Extending UTF-8 to eight bytes.
Version of Sunday 22 January 2020.
Dave Barber's other pages.
§1. UTF-8 is a system of variable-length character encoding used extensively on the Internet and elsewhere for representing the characters of Unicode. It serves as a lossless compression scheme, useful because a Unicode code point can require as many as 21 bits (which may be increased in the future), but many of the most-used characters can feasibly be represented in as few as 8 bits.

In contrast is UTF-32, which allocates 32 bits for every code point, and performs no compression. Although UTF-32 is simple, it is not space-efficient. Also, there are compression schemes other than UTF-8, most notably UTF-16.

A four-byte version of UTF-8 is detailed at RFC 3629; four bytes are sufficient for the current size of the Unicode character set. An earlier six-byte version is discussed at RFC 2279. Beyond that, there is an obvious way to extend UTF-8 to eight bytes, maintaining full compatibility, in order to allow encoding as many as 242 characters. That is the subject of this report. A secondary purpose of this presentation is to reveal, in bit-by-bit detail, how UTF-8 sequences are structured.

The main tables below are written using binary digits, arranged in eight-bit bytes, most significant bit first. The symbols '0' and '1' have their customary meaning. However, two additional symbols require an explanation:

Symbols 'z' and 'w' are used when a bit's role is forming the structure of the UTF-8 encoding, in contrast to representing the value being encoded. This substitution may help the reader understand the system. Finally, 'b' stands for a value bit of either '0' or '1' when distinction need not be made.

An apostrophe is inserted at the middle of each byte for ease of reading.


§2. In this exension of UTF-8, each character is represented by a sequence of bytes, quantity one to eight. The first byte indicates the total number of bytes in the sequence, and may also contain some or all of the bits that indicate the value of the character being encoded (written 'b'). Any remaining bytes are in the format of an extension byte.

Table one shows the format of a byte:

table one
first byte
of sequence
total number of
bytes in sequence
number of
extension bytes
zbbb'bbbb10
wwzb'bbbb21
wwwz'bbbb32
wwww'zbbb43
wwww'wzbb54
wwww'wwzb65
wwww'wwwz76
wwww'wwww87
The format of any extension byte is wzbb'bbbb.


§3. Table two contains the encoding patterns for eight-byte UTF-8. Underscored and in boldface are the bits that are added at each level; they usually number five, but they can be four or six. The rightmost column shows the number of bits that the pattern can hold -- in other words, the number of significant figures. Patterns marked "overlong" are prohibited for any use. The rows of the table are lettered for convenient reference.

table two
eight-byte UTF-8ref.code pointsbits
z000'0000
thru z111'1111
A 0000'0000
thru 0111'1111
7
wwz0'0000 wz00'0000
thru wwz0'0001 wz11'1111
B overlong 
wwz0'0010 wz00'0000
thru wwz1'1111 wz11'1111
C 0000'0000 1000'0000
thru 0000'0111 1111'1111
11
wwwz'0000 wz00'0000 wz00'0000
thru wwwz'0000 wz01'1111 wz11'1111
D overlong 
wwwz'0000 wz10'0000 wz00'0000
thru wwwz'1111 wz11'1111 wz11'1111
E 0000'1000 0000'0000
thru 1111'1111 1111'1111
16
wwww'z000 wz00'0000 wz00'0000 wz00'0000
thru wwww'z000 wz00'1111 wz11'1111 wz11'1111
F overlong 
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G 0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111
21
four-byte UTF-8 stops here
wwww'wz00 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz00 wz00'0111 wz11'1111 wz11'1111 wz11'1111
H overlong 
wwww'wz00 wz00'1000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz11 wz11'1111 wz11'1111 wz11'1111 wz11'1111
I 0000'0000 0010'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111
26
wwww'wwz0 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz0 wz00'0011 wz11'1111 wz11'1111 wz11'1111 wz11'1111
J overlong 
wwww'wwz0 wz00'0100 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz1 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
K 0000'0100 0000'0000 0000'0000 0000'0000
thru 0111'1111 1111'1111 1111'1111 1111'1111
31
six-byte UTF-8 stops here
wwww'wwwz wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz00'0001 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
L overlong 
wwww'wwwz wz00'0010 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
M 0000'0000 1000'0000 0000'0000 0000'0000 0000'0000
thru 0000'1111 1111'1111 1111'1111 1111'1111 1111'1111
36
wwww'wwww wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz00'0000 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
N overlong 
wwww'wwww wz00'0001 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
O 0000'0000 0001'0000 0000'0000 0000'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111 1111'1111 1111'1111
42

Table three is similar to table two, but written in hexadecimal. Symbols 'z' and 'w' are therefore not used.

table three
eight-byte UTF-8ref.code points
00
thru 7F
A 00
thru 7F
C0 80
thru C1 BF
B overlong
C2 80
thru DF BF
C 00 80
thru 07 FF
E0 80 80
thru E0 9F BF
D overlong
E0 A0 80
thru EF BF BF
E 08 00
thru FF FF
F0 80 80 80
thru F0 8F BF BF
F overlong
F0 90 80 80
thru F7 BF BF BF
G 01 00 00
thru 1F FF FF
four-byte UTF-8 stops here
F8 80 80 80 80
thru F8 87 BF BF BF
H overlong
F8 88 80 80 80
thru FB BF BF BF BF
I 00 20 00 00
thru 03 FF FF FF
FC 80 80 80 80 80
thru FC 83 BF BF BF BF
J overlong
FC 84 80 80 80 80
thru FD BF BF BF BF BF
K 04 00 00 00
thru 7F FF FF FF
six-byte UTF-8 stops here
FE 80 80 80 80 80 80
thru FE 81 BF BF BF BF BF
L overlong
FE 82 80 80 80 80 80
thru FE BF BF BF BF BF BF
M 00 80 00 00 00
thru 0F FF FF FF FF
FF 80 80 80 80 80 80 80
thru FF 80 BF BF BF BF BF BF
N overlong
FF 81 80 80 80 80 80 80
thru FF BF BF BF BF BF BF BF
O 00 10 00 00 00 00
thru 03 FF FF FF FF FF


§4. Table four gives an example, using row C, of how the significant bits (eleven in this case) of the raw code point are merely copied -- never changed -- into the bits of the UTF-8 form. The bits remain in their original sequence, as indicated by the subscripts. A hash mark is written between bytes for clarity.

table four
eight-byte UTF-8ref.code points
wwz0'0010 # wz00'0000
thru wwz1'1111 # wz11'1111
C
as above
0000'0000 # 1000'0000
thru 0000'0111 # 1111'1111
wwz 0  ' 0 0 1 0  # wz 0 0  ' 0 0 0 0 
thru wwz 1  ' 1 1 1 1  # wz 1 1  ' 1 1 1 1 
C
expanded
0000 ' 0 0 0 0  # 1 0 0 0  ' 0 0 0 0 
thru 0000 ' 0 1 1 1  # 1 1 1 1  ' 1 1 1 1 
wwz bA ' b9b8b7b6 # wz b5b4 ' b3b2b1b0 C
subscripted
0000 ' 0 bAb9b8 # b7b6b5b4 ' b3b2b1b0

This scheme facilitates use of bit-shifting operations, for which typical computer hardware offers speedy instructions.


§5. The Unicode standard does not use all of row G in the charts above. The highest code point that could be supported in the four-byte version is (in hex) 1F FF FF, but Unicode stops at 10 FF FF, because that leaves plenty of room for all the currently defined characters. Tables five and six are two ways of showing the split:

table five
eight-byte UTF-8ref.code points
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G complete
as above
0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z100 wz00'1111 wz11'1111 wz11'1111
G lower
Unicode
0000'0001 0000'0000 0000'0000
thru 0001'0000 1111'1111 1111'1111
wwww'z100 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G upper
not Unicode
0001'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111

table six
eight-byte UTF-8ref.code points
F0 90 80 80
thru F7 BF BF BF
G complete
as above
01 00 00
thru 1F FF FF
F0 90 80 80
thru F4 8F BF BF
G lower
Unicode
01 00 00
thru 10 FF FF
F4 90 80 80
thru F7 BF BF BF
G upper
not Unicode
11 00 00
thru 1F FF FF


§6. Within any encoding, each bit pattern can appear in only certain places, if at all. This fact aids in detecting errors, and in salvaging as much as possible from a detective series of encodings.

table seven
Any sequence of length one has the format 0bbb'bbbb.

The extension byte, whose format is 10bb'bbbb, will appear in any sequence of at least two bytes.

Further …

this bit pattern cannot appear anywhere in a sequence of underref.
110b'bbbb2 bytesC
1110'bbbb3 bytesE
1111'0bbb4 bytesG
four-byte UTF-8 stops here
1111'10bb5 bytesI
1111'110b6 bytesK
six-byte UTF-8 stops here
1111'11107 bytesM
1111'11118 bytesO

If a communications channel summarily intercepts 1111'1111 for a special purpose, the encoding will be limited to seven bytes.


§7. The UTF-8 patterns marked "overlong" are comprehensively prohibited because any use of them would greatly complicate the algorithms for comparison. When the prohibition is observed, two sequences of UTF-8-encoded Unicode characters can be lexicographically compared byte by byte for the relations equal-to, less-than, and greater-than. The comparison algorithm does not have to know the structure of a UTF-8 encoding; instead it need merely examine the raw byte values one after another.

Also, a naïve use of the prohibited patterns could result in multiple encodings for the same character. The probable errors and possible security risks require that software reading UTF-8 sequences must always detect and reject overlong encodings. Table eight gives examples of misuse, employing the percent sign (%), whose Unicode code point is 0010'0101.

table eight
UTF-8 for %ref.comment
z010'0101A valid
wwz0'0000 wz10'0101B overlong
wwwz'0000 wz00'0000 wz10'0101D overlong
wwww'z000 wz00'0000 wz00'0000 wz10'0101F overlong


§8. UTF-8 can be extended to storage units larger or smaller than the eight-bit byte, as discussed on another page.