Extending UTF-8 to eight bytes.
Version of Saturday 18 February 2020.
Dave Barber's other pages.

§1. UTF-8 is a system of variable-length character encoding used extensively on the Internet and elsewhere for representing the characters of Unicode. It serves as a lossless compression scheme, useful because a Unicode code point can require as many as 21 bits (which may be increased in the future), but many of the most-used characters can feasibly be represented in as few as 8 bits.

In contrast is UTF-32, which allocates 32 bits for every code point, and performs no compression. Although UTF-32 is simple, it is not space-efficient. Besides that, there are compression schemes other than UTF-8, most notably UTF-16.

A four-byte version of UTF-8 is detailed at RFC 3629; four bytes (indeed, three) are sufficient for the current size of the Unicode character set. An earlier six-byte version is discussed at RFC 2279. Beyond that, there is an obvious way to extend UTF-8 to eight bytes, maintaining full compatibility, in order to allow encoding as many as 242 characters. That is the subject of this report. A secondary purpose of this presentation is to reveal, in bit-by-bit detail, how UTF-8 sequences are structured.

The main tables below are written using binary digits, arranged in eight-bit bytes, most significant bit first. The symbols '0' and '1' have their customary meaning. However, two additional symbols require an explanation:

Symbols 'z' and 'w' are used when a bit's role is forming the structure of the UTF-8 encoding, in contrast to representing the value being encoded. This substitution may help the reader understand the system. Finally, 'd' (for "data") stands for a value bit of either '0' or '1' when distinction need not be made.

An apostrophe is inserted at the middle of each byte for ease of reading.


§2. In this exension of UTF-8, each character is represented by a sequence of bytes, quantity one to eight. The first byte indicates the total number of bytes in the sequence, and may also contain some or all of the bits that indicate the value of the character being encoded (written 'd'). Any remaining bytes are in the format of an extension byte.

Table one shows the format of a byte:

table one
first byte
of sequence
total number of
bytes in sequence
number of
extension bytes
zddd'dddd10
wwzd'dddd21
wwwz'dddd32
wwww'zddd43
wwww'wzdd54
wwww'wwzd65
wwww'wwwz76
wwww'wwww87
The format of any extension byte is wzdd'dddd.


§3. Table two contains the encoding patterns for eight-byte UTF-8. Underscored and in boldface are the bits that are added at each level; they usually number five, but they can be four or six. The rightmost column shows the number of bits that the pattern can hold -- in other words, the number of significant figures. Patterns marked "overlong" are prohibited for any use. The rows of the table are lettered for reference.

table two
eight-byte
UTF-8
ref. code
points
net
bits
gross
bits
z000'0000
thru z111'1111
A 0000'0000
thru 0111'1111
78
wwz0'0000 wz00'0000
thru wwz0'0001 wz11'1111
B overlong  
wwz0'0010 wz00'0000
thru wwz1'1111 wz11'1111
C 0000'0000 1000'0000
thru 0000'0111 1111'1111
1116
wwwz'0000 wz00'0000 wz00'0000
thru wwwz'0000 wz01'1111 wz11'1111
D overlong  
wwwz'0000 wz10'0000 wz00'0000
thru wwwz'1111 wz11'1111 wz11'1111
E 0000'1000 0000'0000
thru 1111'1111 1111'1111
1616
wwww'z000 wz00'0000 wz00'0000 wz00'0000
thru wwww'z000 wz00'1111 wz11'1111 wz11'1111
F overlong  
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G 0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111
2124
four-byte UTF-8 stops here
wwww'wz00 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz00 wz00'0111 wz11'1111 wz11'1111 wz11'1111
H overlong  
wwww'wz00 wz00'1000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wz11 wz11'1111 wz11'1111 wz11'1111 wz11'1111
I 0000'0000 0010'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111
2632
wwww'wwz0 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz0 wz00'0011 wz11'1111 wz11'1111 wz11'1111 wz11'1111
J overlong  
wwww'wwz0 wz00'0100 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwz1 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
K 0000'0100 0000'0000 0000'0000 0000'0000
thru 0111'1111 1111'1111 1111'1111 1111'1111
3132
six-byte UTF-8 stops here
wwww'wwwz wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz00'0001 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
L overlong  
wwww'wwwz wz00'0010 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwwz wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
M 0000'0000 1000'0000 0000'0000 0000'0000 0000'0000
thru 0000'1111 1111'1111 1111'1111 1111'1111 1111'1111
3640
wwww'wwww wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz00'0000 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
N overlong  
wwww'wwww wz00'0001 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000
thru wwww'wwww wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111
O 0000'0000 0001'0000 0000'0000 0000'0000 0000'0000 0000'0000
thru 0000'0011 1111'1111 1111'1111 1111'1111 1111'1111 1111'1111
4248

Table three is similar to table two, but written in hexadecimal. Therefore the symbols 'z' and 'w' cannot be used.

table three
eight-byte UTF-8ref.code points
00
thru 7F
A 00
thru 7F
C0 80
thru C1 BF
B overlong
C2 80
thru DF BF
C 00 80
thru 07 FF
E0 80 80
thru E0 9F BF
D overlong
E0 A0 80
thru EF BF BF
E 08 00
thru FF FF
F0 80 80 80
thru F0 8F BF BF
F overlong
F0 90 80 80
thru F7 BF BF BF
G 01 00 00
thru 1F FF FF
four-byte UTF-8 stops here
F8 80 80 80 80
thru F8 87 BF BF BF
H overlong
F8 88 80 80 80
thru FB BF BF BF BF
I 00 20 00 00
thru 03 FF FF FF
FC 80 80 80 80 80
thru FC 83 BF BF BF BF
J overlong
FC 84 80 80 80 80
thru FD BF BF BF BF BF
K 04 00 00 00
thru 7F FF FF FF
six-byte UTF-8 stops here
FE 80 80 80 80 80 80
thru FE 81 BF BF BF BF BF
L overlong
FE 82 80 80 80 80 80
thru FE BF BF BF BF BF BF
M 00 80 00 00 00
thru 0F FF FF FF FF
FF 80 80 80 80 80 80 80
thru FF 80 BF BF BF BF BF BF
N overlong
FF 81 80 80 80 80 80 80
thru FF BF BF BF BF BF BF BF
O 00 10 00 00 00 00
thru 03 FF FF FF FF FF


§4. Table four gives an example, using row C, of how the significant bits (eleven in this case) of the raw code point are merely copied -- never changed -- into the bits of the UTF-8 form. The bits remain in their original sequence, as indicated by the subscripts. A hash mark is written between bytes for clarity.

table four
eight-byte UTF-8ref.code points
wwz0'0010 # wz00'0000
thru wwz1'1111 # wz11'1111
C
as above
0000'0000 # 1000'0000
thru 0000'0111 # 1111'1111
wwz 0  ' 0 0 1 0  # wz 0 0  ' 0 0 0 0 
thru wwz 1  ' 1 1 1 1  # wz 1 1  ' 1 1 1 1 
C
expanded
0000 ' 0 0 0 0  # 1 0 0 0  ' 0 0 0 0 
thru 0000 ' 0 1 1 1  # 1 1 1 1  ' 1 1 1 1 
wwz dA ' d9d8d7d6 # wz d5d4 ' d3d2d1d0 C
subscripted
0000 ' 0 dAd9d8 # d7d6d5d4 ' d3d2d1d0

This scheme facilitates use of bit-shifting and bit-masking operations, for which typical computer hardware offers speedy instructions.


§5. The Unicode standard does not use all of row G in the charts above. The highest code point that could be supported in the four-byte version is (in hex) 1F FF FF, but Unicode stops at 10 FF FF, because that leaves plenty of room for all the currently defined characters. Tables five and six are two ways of showing the split:

table five
eight-byte UTF-8ref.code points
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G complete
as above
0000'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111
wwww'z000 wz01'0000 wz00'0000 wz00'0000
thru wwww'z100 wz00'1111 wz11'1111 wz11'1111
G lower
Unicode
0000'0001 0000'0000 0000'0000
thru 0001'0000 1111'1111 1111'1111
wwww'z100 wz01'0000 wz00'0000 wz00'0000
thru wwww'z111 wz11'1111 wz11'1111 wz11'1111
G upper
not Unicode
0001'0001 0000'0000 0000'0000
thru 0001'1111 1111'1111 1111'1111

table six
eight-byte UTF-8ref.code points
F0 90 80 80
thru F7 BF BF BF
G complete
as above
01 00 00
thru 1F FF FF
F0 90 80 80
thru F4 8F BF BF
G lower
Unicode
01 00 00
thru 10 FF FF
F4 90 80 80
thru F7 BF BF BF
G upper
not Unicode
11 00 00
thru 1F FF FF


§6. Within any encoding, each bit pattern can appear in only certain places, if at all. This fact aids in detecting errors, and in salvaging as much as possible from a detective series of encodings.

table seven
Any sequence of length one has the format 0ddd'dddd.

The extension byte, whose format is 10dd'dddd, will appear in any sequence of at least two bytes.

Further …

this bit pattern cannot appear anywhere in a sequence of underref.
110d'dddd2 bytesC
1110'dddd3 bytesE
1111'0ddd4 bytesG
1111'10dd5 bytesI
1111'110d6 bytesK
1111'11107 bytesM
1111'11118 bytesO

If a communications channel summarily intercepts 1111'1111 for a special purpose, the encoding will be limited to seven bytes.


§7. The UTF-8 patterns marked "overlong" are comprehensively prohibited because any use of them would greatly complicate the algorithms for comparison. When the prohibition is observed, two sequences of UTF-8-encoded Unicode characters can be lexicographically compared byte by byte for the relations equal-to, less-than, and greater-than. The comparison algorithm does not have to know the structure of a UTF-8 encoding; instead it need merely examine the raw byte values one after another.

Also, a naïve use of the prohibited patterns could result in multiple encodings for the same character. The probable errors and possible security risks require that software reading UTF-8 sequences must always detect and reject overlong encodings. Table eight gives examples of misuse, employing the percent sign (%), whose Unicode code point is 0010'0101.

table eight
UTF-8 for %ref.comment
z010'0101A valid
wwz0'0000 wz10'0101B overlong
wwwz'0000 wz00'0000 wz10'0101D overlong
wwww'z000 wz00'0000 wz00'0000 wz10'0101F overlong


§8. The usual definition of UTF-8 assumes that computer memory is organized into memory units of eight bits. However, UTF-8 can be adapted to other sizes of memory units, here called hunks. Some examples are presented in table nine, starting with nine bits per hunk and working down. The practical minumum is around five bits.

 
table nine
 
9 bits
per
hunk
total bits
8z'dddd'dddd
13w'wzdd'ddddw'zddd'dddd
19w'wwzd'ddddw'zddd'dddd × 2
25w'wwwz'ddddw'zddd'dddd × 3
31w'wwww'zdddw'zddd'dddd × 4
37w'wwww'wzddw'zddd'dddd × 5
43w'wwww'wwzdw'zddd'dddd × 6
49w'wwww'wwwzw'zddd'dddd × 7
56w'wwww'wwwww'zddd'dddd × 8
8 bits
per
hunk
total bits same as the system above
7zddd'dddd
11wwzd'ddddwzdd'dddd
16wwwz'ddddwzdd'dddd × 2
21wwww'zdddwzdd'dddd × 3
26wwww'wzddwzdd'dddd × 4
31wwww'wwzdwzdd'dddd × 5
36wwww'wwwzwzdd'dddd × 6
42wwww'wwwwwzdd'dddd × 7
7 bits
per
hunk
total bits
6zdd'dddd
9wwz'ddddwzd'dddd
13www'zdddwzd'dddd × 2
17www'wzddwzd'dddd × 3
21www'wwzdwzd'dddd × 4
25www'wwwzwzd'dddd × 5
30www'wwwwwzd'dddd × 6
6 bits
per
hunk
total bits
5zd'dddd
7ww'zdddwz'dddd
10ww'wzddwz'dddd × 2
13ww'wwzdwz'dddd × 3
16ww'wwwzwz'dddd × 4
20ww'wwwwwz'dddd × 5
5 bits
per
hunk
total bits
4z'dddd
5w'wzddw'zddd
7w'wwzdw'zddd × 2
9w'wwwzw'zddd × 3
12w'wwwww'zddd × 4
4 bits
per
hunk
total bits
3zddd
3wwzdwzdd
4wwwzwzdd × 2
6wwwwwzdd × 3
3 bits
per
hunk
total bits
2zdd
1wwzwzd
2wwwwzd × 2
2 bits
per
hunk
total bits
1zd
0wwwz
1 bit
per
hunk
total bits
0z


§9. There is a variant of UTF-8 that will support codes up to 65 bits, which may be helpful if Unicode is ever extended from its current 32-bit format to a 64-bit format. Table ten explains it, including regular UTF-8 for comparison. To make the pattern clearer, superscript notation is used for repeated bits, and the extension byte is shown at its place in numerical order.

65-bit
variant
table ten ordinary
UTF-8
total bits first byte
of sequence
number of
extension bytes
first byte
of sequence
total bits
7 z d7 0 z d7 7
w1zz d5 (extension
byte itself)
w1z d6
5 + 1 × 5 = 10 w1zw d5 1 w2z d5 5 + 1 × 6 = 11
4 + 2 × 5 = 14 w2zz d4 2 w3z d4 4 + 2 × 6 = 16
4 + 3 × 5 = 19 w2zw d4 3 w4z d3 3 + 3 × 6 = 21
3 + 4 × 5 = 23 w3zz d3 4 w5z d2 2 + 4 × 6 = 26
3 + 5 × 5 = 28 w3zw d3 5 w6z d1 1 + 5 × 6 = 31
2 + 6 × 5 = 32 w4zz d2 6 w7z 6 × 6 = 36
2 + 7 × 5 = 37 w4zw d2 7 w8 7 × 6 = 42
1 + 8 × 5 = 41 w5zz d1 8
1 + 9 × 5 = 46 w5zw d1 9
10 × 5 = 50 w6zz 10
11 × 5 = 55 w6zw 11
12 × 5 = 60 w7z 12
13 × 5 = 65 w7w 13