§1. UTF-8 is a system of variable-length character encoding used extensively on the Internet and elsewhere for representing the characters of Unicode. It serves as a lossless compression scheme, useful because a Unicode code point can require as many as 21 bits (which may be increased in the future), but many of the most-used characters can feasibly be represented in as few as 8 bits.
In contrast is UTF-32, which allocates 32 bits for every code point, and performs no compression. Although UTF-32 is simple, it is not space-efficient. Besides that, there are compression schemes other than UTF-8, most notably UTF-16.
A four-byte version of UTF-8 is detailed at RFC 3629; four bytes (indeed, three) are sufficient for the current size of the Unicode character set. An earlier six-byte version is discussed at RFC 2279. Beyond that, there is an obvious way to extend UTF-8 to eight bytes, maintaining full compatibility, in order to allow encoding as many as 242 characters. That is the subject of this report. A secondary purpose of this presentation is to reveal, in bit-by-bit detail, how UTF-8 sequences are structured.
The main tables below are written using binary digits, arranged in eight-bit bytes, most significant bit first. The symbols '0' and '1' have their customary meaning. However, two additional symbols require an explanation:
An apostrophe is inserted at the middle of each byte for ease of reading.
§2. In this exension of UTF-8, each character is represented by a sequence of bytes, quantity one to eight. The first byte indicates the total number of bytes in the sequence, and may also contain some or all of the bits that indicate the value of the character being encoded (written 'd'). Any remaining bytes are in the format of an extension byte.
Table one shows the format of a byte:
table one | ||
first byte of sequence | total number of bytes in sequence | number of extension bytes |
---|---|---|
zddd'dddd | 1 | 0 |
wwzd'dddd | 2 | 1 |
wwwz'dddd | 3 | 2 |
wwww'zddd | 4 | 3 |
wwww'wzdd | 5 | 4 |
wwww'wwzd | 6 | 5 |
wwww'wwwz | 7 | 6 |
wwww'wwww | 8 | 7 |
The format of any extension byte is wzdd'dddd. |
§3. Table two contains the encoding patterns for eight-byte UTF-8. Underscored and in boldface are the bits that are added at each level; they usually number five, but they can be four or six. The rightmost column shows the number of bits that the pattern can hold -- in other words, the number of significant figures. Patterns marked "overlong" are prohibited for any use. The rows of the table are lettered for reference.
table two | ||||
eight-byte UTF-8 | ref. | code points | net bits | gross bits |
---|---|---|---|---|
z000'0000
thru z111'1111 | A | 0000'0000
thru 0111'1111 | 7 | 8 |
wwz0'0000 wz00'0000
thru wwz0'0001 wz11'1111 | B | overlong | ||
wwz0'0010 wz00'0000
thru wwz1'1111 wz11'1111 | C | 0000'0000 1000'0000
thru 0000'0111 1111'1111 | 11 | 16 |
wwwz'0000 wz00'0000 wz00'0000
thru wwwz'0000 wz01'1111 wz11'1111 | D | overlong | ||
wwwz'0000 wz10'0000 wz00'0000
thru wwwz'1111 wz11'1111 wz11'1111 | E | 0000'1000 0000'0000
thru 1111'1111 1111'1111 | 16 | 16 |
wwww'z000 wz00'0000 wz00'0000 wz00'0000
thru wwww'z000 wz00'1111 wz11'1111 wz11'1111 | F | overlong | ||
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 | G | 0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111 | 21 | 24 |
four-byte UTF-8 stops here | ||||
wwww'wz00 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz00 wz00'0111 wz11'1111 wz11'1111 wz11'1111 | H | overlong | ||
wwww'wz00 wz00'1000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz11 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | I | 0000'0000 0010'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111 | 26 | 32 |
wwww'wwz0 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz0 wz00'0011 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | J | overlong | ||
wwww'wwz0 wz00'0100 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz1 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | K | 0000'0100 0000'0000 0000'0000 0000'0000
thru 0111'1111 1111'1111 1111'1111 1111'1111 | 31 | 32 |
six-byte UTF-8 stops here | ||||
wwww'wwwz wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz00'0001 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | L | overlong | ||
wwww'wwwz wz00'0010 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | M | 0000'0000 1000'0000 0000'0000 0000'0000 0000'0000
thru 0000'1111 1111'1111 1111'1111 1111'1111 1111'1111 | 36 | 40 |
wwww'wwww wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz00'0000 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | N | overlong | ||
wwww'wwww wz00'0001 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 | O | 0000'0000 0001'0000 0000'0000 0000'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111 1111'1111 1111'1111 | 42 | 48 |
Table three is similar to table two, but written in hexadecimal. Therefore the symbols 'z' and 'w' cannot be used.
table three | ||
eight-byte UTF-8 | ref. | code points |
---|---|---|
00
thru 7F | A | 00
thru 7F |
C0 80
thru C1 BF | B | overlong |
C2 80
thru DF BF | C | 00 80
thru 07 FF |
E0 80 80
thru E0 9F BF | D | overlong |
E0 A0 80
thru EF BF BF | E | 08 00
thru FF FF |
F0 80 80 80
thru F0 8F BF BF | F | overlong |
F0 90 80 80
thru F7 BF BF BF | G | 01 00 00
thru 1F FF FF |
four-byte UTF-8 stops here | ||
F8 80 80 80 80
thru F8 87 BF BF BF | H | overlong |
F8 88 80 80 80
thru FB BF BF BF BF | I | 00 20 00 00
thru 03 FF FF FF |
FC 80 80 80 80 80
thru FC 83 BF BF BF BF | J | overlong |
FC 84 80 80 80 80
thru FD BF BF BF BF BF | K | 04 00 00 00
thru 7F FF FF FF |
six-byte UTF-8 stops here | ||
FE 80 80 80 80 80 80
thru FE 81 BF BF BF BF BF | L | overlong |
FE 82 80 80 80 80 80
thru FE BF BF BF BF BF BF | M | 00 80 00 00 00
thru 0F FF FF FF FF |
FF 80 80 80 80 80 80 80
thru FF 80 BF BF BF BF BF BF | N | overlong |
FF 81 80 80 80 80 80 80
thru FF BF BF BF BF BF BF BF | O | 00 10 00 00 00 00
thru 03 FF FF FF FF FF |
§4. Table four gives an example, using row C, of how the significant bits (eleven in this case) of the raw code point are merely copied -- never changed -- into the bits of the UTF-8 form. The bits remain in their original sequence, as indicated by the subscripts. A hash mark is written between bytes for clarity.
table four | ||
eight-byte UTF-8 | ref. | code points |
---|---|---|
wwz0'0010 # wz00'0000
thru wwz1'1111 # wz11'1111 | C as above | 0000'0000 # 1000'0000
thru 0000'0111 # 1111'1111 |
wwz 0 '
0 0 1 0 #
wz 0 0 '
0 0 0 0
thru wwz 1 ' 1 1 1 1 # wz 1 1 ' 1 1 1 1 | C expanded |
0000 '
0 0 0 0 #
1 0 0 0 '
0 0 0 0
thru 0000 ' 0 1 1 1 # 1 1 1 1 ' 1 1 1 1 |
wwz dA ' d9d8d7d6 # wz d5d4 ' d3d2d1d0 | C subscripted | 0000 ' 0 dAd9d8 # d7d6d5d4 ' d3d2d1d0 |
This scheme facilitates use of bit-shifting and bit-masking operations, for which typical computer hardware offers speedy instructions.
§5. The Unicode standard does not use all of row G in the charts above. The highest code point that could be supported in the four-byte version is (in hex) 1F FF FF, but Unicode stops at 10 FF FF, because that leaves plenty of room for all the currently defined characters. Tables five and six are two ways of showing the split:
table five | ||
eight-byte UTF-8 | ref. | code points |
---|---|---|
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 | G complete as above | 0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111 |
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z100 wz00'1111 wz11'1111 wz11'1111 | G lower Unicode | 0000'0001 0000'0000 0000'0000
thru 0001'0000 1111'1111 1111'1111 |
wwww'z100 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 | G upper not Unicode | 0001'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111 |
table six | ||
eight-byte UTF-8 | ref. | code points |
---|---|---|
F0 90 80 80
thru F7 BF BF BF | G complete as above | 01 00 00
thru 1F FF FF |
F0 90 80 80
thru F4 8F BF BF | G lower Unicode | 01 00 00
thru 10 FF FF |
F4 90 80 80
thru F7 BF BF BF | G upper not Unicode | 11 00 00
thru 1F FF FF |
§6. Within any encoding, each bit pattern can appear in only certain places, if at all. This fact aids in detecting errors, and in salvaging as much as possible from a detective series of encodings.
table seven | ||
Any sequence of length one has the format 0ddd'dddd.
The extension byte, whose format is 10dd'dddd, will appear in any sequence of at least two bytes. Further … | ||
this bit pattern cannot appear | anywhere in a sequence of under | ref. |
---|---|---|
110d'dddd | 2 bytes | C |
1110'dddd | 3 bytes | E |
1111'0ddd | 4 bytes | G |
1111'10dd | 5 bytes | I |
1111'110d | 6 bytes | K |
1111'1110 | 7 bytes | M |
1111'1111 | 8 bytes | O |
If a communications channel summarily intercepts 1111'1111 for a special purpose, the encoding will be limited to seven bytes.
§7. The UTF-8 patterns marked "overlong" are comprehensively prohibited because any use of them would greatly complicate the algorithms for comparison. When the prohibition is observed, two sequences of UTF-8-encoded Unicode characters can be lexicographically compared byte by byte for the relations equal-to, less-than, and greater-than. The comparison algorithm does not have to know the structure of a UTF-8 encoding; instead it need merely examine the raw byte values one after another.
Also, a naïve use of the prohibited patterns could result in multiple encodings for the same character. The probable errors and possible security risks require that software reading UTF-8 sequences must always detect and reject overlong encodings. Table eight gives examples of misuse, employing the percent sign (%), whose Unicode code point is 0010'0101.
table eight | ||
UTF-8 for % | ref. | comment |
---|---|---|
z010'0101 | A | valid |
wwz0'0000 wz10'0101 | B | overlong |
wwwz'0000 wz00'0000 wz10'0101 | D | overlong |
wwww'z000 wz00'0000 wz00'0000 wz10'0101 | F | overlong |
§8. The usual definition of UTF-8 assumes that computer memory is organized into memory units of eight bits. However, UTF-8 can be adapted to other sizes of memory units, here called hunks. Some examples are presented in table nine, starting with nine bits per hunk and working down. The practical minumum is around five bits.
table nine | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
§9. There is a variant of UTF-8 that will support codes up to 65 bits, which may be helpful if Unicode is ever extended from its current 32-bit format to a 64-bit format. Table ten explains it, including regular UTF-8 for comparison. To make the pattern clearer, superscript notation is used for repeated bits, and the extension byte is shown at its place in numerical order.
65-bit variant | table ten | ordinary UTF-8 | ||
---|---|---|---|---|
total bits | first byte of sequence | number of extension bytes | first byte of sequence | total bits |
7 | z d7 | 0 | z d7 | 7 |
— | w1zz d5 | (extension byte itself) | w1z d6 | — |
5 + 1 × 5 = 10 | w1zw d5 | 1 | w2z d5 | 5 + 1 × 6 = 11 |
4 + 2 × 5 = 14 | w2zz d4 | 2 | w3z d4 | 4 + 2 × 6 = 16 |
4 + 3 × 5 = 19 | w2zw d4 | 3 | w4z d3 | 3 + 3 × 6 = 21 |
3 + 4 × 5 = 23 | w3zz d3 | 4 | w5z d2 | 2 + 4 × 6 = 26 |
3 + 5 × 5 = 28 | w3zw d3 | 5 | w6z d1 | 1 + 5 × 6 = 31 |
2 + 6 × 5 = 32 | w4zz d2 | 6 | w7z | 6 × 6 = 36 |
2 + 7 × 5 = 37 | w4zw d2 | 7 | w8 | 7 × 6 = 42 |
1 + 8 × 5 = 41 | w5zz d1 | 8 | ||
1 + 9 × 5 = 46 | w5zw d1 | 9 | ||
10 × 5 = 50 | w6zz | 10 | ||
11 × 5 = 55 | w6zw | 11 | ||
12 × 5 = 60 | w7z | 12 | ||
13 × 5 = 65 | w7w | 13 |