8-byte UTF-8

Extending UTF-8 to eight bytes.
Version of Saturday 18 February 2020.
Dave Barber's other pages.

§1. UTF-8 is a system of variable-length character encoding used extensively on the Internet and elsewhere for representing the characters of Unicode. It serves as a lossless compression scheme, useful because a Unicode code point can require as many as 21 bits (which may be increased in the future), but many of the most-used characters can feasibly be represented in as few as 8 bits.

In contrast is UTF-32, which allocates 32 bits for every code point, and performs no compression. Although UTF-32 is simple, it is not space-efficient. Besides that, there are compression schemes other than UTF-8, most notably UTF-16.

A four-byte version of UTF-8 is detailed at RFC 3629; four bytes (indeed, three) are sufficient for the current size of the Unicode character set. An earlier six-byte version is discussed at RFC 2279. Beyond that, there is an obvious way to extend UTF-8 to eight bytes, maintaining full compatibility, in order to allow encoding as many as 2⁴² characters. That is the subject of this report. A secondary purpose of this presentation is to reveal, in bit-by-bit detail, how UTF-8 sequences are structured.

The main tables below are written using binary digits, arranged in eight-bit bytes, most significant bit first. The symbols '0' and '1' have their customary meaning. However, two additional symbols require an explanation:

'z' = '0' ("zero");
'w' = '1' ("won" is a homophone of "one").

Symbols 'z' and 'w' are used when a bit's role is forming the structure of the UTF-8 encoding, in contrast to representing the value being encoded. This substitution may help the reader understand the system. Finally, 'd' (for "data") stands for a value bit of either '0' or '1' when distinction need not be made.

An apostrophe is inserted at the middle of each byte for ease of reading.

§2. In this exension of UTF-8, each character is represented by a sequence of bytes, quantity one to eight. The first byte indicates the total number of bytes in the sequence, and may also contain some or all of the bits that indicate the value of the character being encoded (written 'd'). Any remaining bytes are in the format of an extension byte.

Table one shows the format of a byte:

table one

first byte
of sequence total number of
bytes in sequence number of
extension bytes

zddd'dddd 1 0
wwzd'dddd 2 1
wwwz'dddd 3 2
wwww'zddd 4 3
wwww'wzdd 5 4
wwww'wwzd 6 5
wwww'wwwz 7 6
wwww'wwww 8 7
The format of any extension byte is wzdd'dddd.

table one
first byte of sequence	total number of bytes in sequence	number of extension bytes
`zddd'dddd`	1	0
`wwzd'dddd`	2	1
`wwwz'dddd`	3	2
`wwww'zddd`	4	3
`wwww'wzdd`	5	4
`wwww'wwzd`	6	5
`wwww'wwwz`	7	6
`wwww'wwww`	8	7
The format of any extension byte is `wzdd'dddd`.

§3. Table two contains the encoding patterns for eight-byte UTF-8. Underscored and in boldface are the bits that are added at each level; they usually number five, but they can be four or six. The rightmost column shows the number of bits that the pattern can hold -- in other words, the number of significant figures. Patterns marked "overlong" are prohibited for any use. The rows of the table are lettered for reference.

table two

eight-byte
UTF-8 ref. code
points net
bits gross
bits

z000'0000 thru z111'1111 A 0000'0000 thru 0111'1111 7 8

wwz0'0000 wz00'0000 thru wwz0'0001 wz11'1111 B overlong

wwz0'0010 wz00'0000 thru wwz1'1111 wz11'1111 C 0000'0000 1000'0000 thru 0000'0111 1111'1111 11 16

wwwz'0000 wz00'0000 wz00'0000 thru wwwz'0000 wz01'1111 wz11'1111 D overlong

wwwz'0000 wz10'0000 wz00'0000 thru wwwz'1111 wz11'1111 wz11'1111 E 0000'1000 0000'0000 thru 1111'1111 1111'1111 16 16

wwww'z000 wz00'0000 wz00'0000 wz00'0000 thru wwww'z000 wz00'1111 wz11'1111 wz11'1111 F overlong

wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 G 0000'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111 21 24

four-byte UTF-8 stops here

wwww'wz00 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wz00 wz00'0111 wz11'1111 wz11'1111 wz11'1111 H overlong

wwww'wz00 wz00'1000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wz11 wz11'1111 wz11'1111 wz11'1111 wz11'1111 I 0000'0000 0010'0000 0000'0000 0000'0000 thru 0000'0011 1111'1111 1111'1111 1111'1111 26 32

wwww'wwz0 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwz0 wz00'0011 wz11'1111 wz11'1111 wz11'1111 wz11'1111 J overlong

wwww'wwz0 wz00'0100 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwz1 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 K 0000'0100 0000'0000 0000'0000 0000'0000 thru 0111'1111 1111'1111 1111'1111 1111'1111 31 32

six-byte UTF-8 stops here

wwww'wwwz wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwwz wz00'0001 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 L overlong

wwww'wwwz wz00'0010 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwwz wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 M 0000'0000 1000'0000 0000'0000 0000'0000 0000'0000 thru 0000'1111 1111'1111 1111'1111 1111'1111 1111'1111 36 40

wwww'wwww wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwww wz00'0000 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 N overlong

wwww'wwww wz00'0001 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwww wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 O 0000'0000 0001'0000 0000'0000 0000'0000 0000'0000 0000'0000 thru 0000'0011 1111'1111 1111'1111 1111'1111 1111'1111 1111'1111 42 48

table two
eight-byte UTF-8	ref.	code points	net bits	gross bits
`z000'0000 thru z111'1111`	A	`0000'0000 thru 0111'1111`	7	8
`wwz0'0000 wz00'0000 thru wwz0'0001 wz11'1111`	B	overlong
`wwz0'0010 wz00'0000 thru wwz1'1111 wz11'1111`	C	`0000'0000 1000'0000 thru 0000'0111 1111'1111`	11	16
`wwwz'0000 wz00'0000 wz00'0000 thru wwwz'0000 wz01'1111 wz11'1111`	D	overlong
`wwwz'0000 wz10'0000 wz00'0000 thru wwwz'1111 wz11'1111 wz11'1111`	E	`0000'1000 0000'0000 thru 1111'1111 1111'1111`	16	16
`wwww'z000 wz00'0000 wz00'0000 wz00'0000 thru wwww'z000 wz00'1111 wz11'1111 wz11'1111`	F	overlong
`wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111`	G	`0000'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111`	21	24
four-byte UTF-8 stops here
`wwww'wz00 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wz00 wz00'0111 wz11'1111 wz11'1111 wz11'1111`	H	overlong
`wwww'wz00 wz00'1000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wz11 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	I	`0000'0000 0010'0000 0000'0000 0000'0000 thru 0000'0011 1111'1111 1111'1111 1111'1111`	26	32
`wwww'wwz0 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwz0 wz00'0011 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	J	overlong
`wwww'wwz0 wz00'0100 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwz1 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	K	`0000'0100 0000'0000 0000'0000 0000'0000 thru 0111'1111 1111'1111 1111'1111 1111'1111`	31	32
six-byte UTF-8 stops here
`wwww'wwwz wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwwz wz00'0001 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	L	overlong
`wwww'wwwz wz00'0010 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwwz wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	M	`0000'0000 1000'0000 0000'0000 0000'0000 0000'0000 thru 0000'1111 1111'1111 1111'1111 1111'1111 1111'1111`	36	40
`wwww'wwww wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwww wz00'0000 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	N	overlong
`wwww'wwww wz00'0001 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 wz00'0000 thru wwww'wwww wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111 wz11'1111`	O	`0000'0000 0001'0000 0000'0000 0000'0000 0000'0000 0000'0000 thru 0000'0011 1111'1111 1111'1111 1111'1111 1111'1111 1111'1111`	42	48

Table three is similar to table two, but written in hexadecimal. Therefore the symbols 'z' and 'w' cannot be used.

table three

eight-byte UTF-8 ref. code points

00 thru 7F A 00 thru 7F

C0 80 thru C1 BF B overlong

C2 80 thru DF BF C 00 80 thru 07 FF

E0 80 80 thru E0 9F BF D overlong

E0 A0 80 thru EF BF BF E 08 00 thru FF FF

F0 80 80 80 thru F0 8F BF BF F overlong

F0 90 80 80 thru F7 BF BF BF G 01 00 00 thru 1F FF FF

four-byte UTF-8 stops here

F8 80 80 80 80 thru F8 87 BF BF BF H overlong

F8 88 80 80 80 thru FB BF BF BF BF I 00 20 00 00 thru 03 FF FF FF

FC 80 80 80 80 80 thru FC 83 BF BF BF BF J overlong

FC 84 80 80 80 80 thru FD BF BF BF BF BF K 04 00 00 00 thru 7F FF FF FF

six-byte UTF-8 stops here

FE 80 80 80 80 80 80 thru FE 81 BF BF BF BF BF L overlong

FE 82 80 80 80 80 80 thru FE BF BF BF BF BF BF M 00 80 00 00 00 thru 0F FF FF FF FF

FF 80 80 80 80 80 80 80 thru FF 80 BF BF BF BF BF BF N overlong

FF 81 80 80 80 80 80 80 thru FF BF BF BF BF BF BF BF O 00 10 00 00 00 00 thru 03 FF FF FF FF FF

table three
eight-byte UTF-8	ref.	code points
`00 thru 7F`	A	`00 thru 7F`
`C0 80 thru C1 BF`	B	overlong
`C2 80 thru DF BF`	C	`00 80 thru 07 FF`
`E0 80 80 thru E0 9F BF`	D	overlong
`E0 A0 80 thru EF BF BF`	E	`08 00 thru FF FF`
`F0 80 80 80 thru F0 8F BF BF`	F	overlong
`F0 90 80 80 thru F7 BF BF BF`	G	`01 00 00 thru 1F FF FF`
four-byte UTF-8 stops here
`F8 80 80 80 80 thru F8 87 BF BF BF`	H	overlong
`F8 88 80 80 80 thru FB BF BF BF BF`	I	`00 20 00 00 thru 03 FF FF FF`
`FC 80 80 80 80 80 thru FC 83 BF BF BF BF`	J	overlong
`FC 84 80 80 80 80 thru FD BF BF BF BF BF`	K	`04 00 00 00 thru 7F FF FF FF`
six-byte UTF-8 stops here
`FE 80 80 80 80 80 80 thru FE 81 BF BF BF BF BF`	L	overlong
`FE 82 80 80 80 80 80 thru FE BF BF BF BF BF BF`	M	`00 80 00 00 00 thru 0F FF FF FF FF`
`FF 80 80 80 80 80 80 80 thru FF 80 BF BF BF BF BF BF`	N	overlong
`FF 81 80 80 80 80 80 80 thru FF BF BF BF BF BF BF BF`	O	`00 10 00 00 00 00 thru 03 FF FF FF FF FF`

§4. Table four gives an example, using row C, of how the significant bits (eleven in this case) of the raw code point are merely copied -- never changed -- into the bits of the UTF-8 form. The bits remain in their original sequence, as indicated by the subscripts. A hash mark is written between bytes for clarity.

table four

eight-byte UTF-8 ref. code points

wwz0'0010 # wz00'0000 thru wwz1'1111 # wz11'1111 C
as above 0000'0000 # 1000'0000 thru 0000'0111 # 1111'1111

wwz0 ' 0010 # wz00 ' 0000 thru wwz1 ' 1111 # wz11 ' 1111 C
expanded 0000 ' 0000 # 1000 ' 0000 thru 0000 ' 0111 # 1111 ' 1111

wwzd_A ' d₉d₈d₇d₆ # wzd₅d₄ ' d₃d₂d₁d₀ C
subscripted 0000 ' 0d_Ad₉d₈ # d₇d₆d₅d₄ ' d₃d₂d₁d₀

table four
eight-byte UTF-8	ref.	code points
`wwz0'0010 # wz00'0000 thru wwz1'1111 # wz11'1111`	C as above	`0000'0000 # 1000'0000 thru 0000'0111 # 1111'1111`
`wwz0 ' 0010 # wz00 ' 0000 thru wwz1 ' 1111 # wz11 ' 1111`	C expanded	`0000 ' 0000 # 1000 ' 0000 thru 0000 ' 0111 # 1111 ' 1111`
`wwzd_A ' d₉d₈d₇d₆ # wzd₅d₄ ' d₃d₂d₁d₀`	C subscripted	`0000 ' 0d_Ad₉d₈ # d₇d₆d₅d₄ ' d₃d₂d₁d₀`

This scheme facilitates use of bit-shifting and bit-masking operations, for which typical computer hardware offers speedy instructions.

§5. The Unicode standard does not use all of row G in the charts above. The highest code point that could be supported in the four-byte version is (in hex) 1F FF FF, but Unicode stops at 10 FF FF, because that leaves plenty of room for all the currently defined characters. Tables five and six are two ways of showing the split:

table five

eight-byte UTF-8 ref. code points

wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 G complete
as above 0000'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111

wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z100 wz00'1111 wz11'1111 wz11'1111 G lower
Unicode 0000'0001 0000'0000 0000'0000 thru 0001'0000 1111'1111 1111'1111

wwww'z100 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111 G upper
not Unicode 0001'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111

table five
eight-byte UTF-8	ref.	code points
`wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111`	G complete as above	`0000'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111`
`wwww'z000 wz01'0000 wz00'0000 wz00'0000 thru wwww'z100 wz00'1111 wz11'1111 wz11'1111`	G lower Unicode	`0000'0001 0000'0000 0000'0000 thru 0001'0000 1111'1111 1111'1111`
`wwww'z100 wz01'0000 wz00'0000 wz00'0000 thru wwww'z111 wz11'1111 wz11'1111 wz11'1111`	G upper not Unicode	`0001'0001 0000'0000 0000'0000 thru 0001'1111 1111'1111 1111'1111`

table six

eight-byte UTF-8 ref. code points

F0 90 80 80 thru F7 BF BF BF G complete
as above 01 00 00 thru 1F FF FF

F0 90 80 80 thru F4 8F BF BF G lower
Unicode 01 00 00 thru 10 FF FF

F4 90 80 80 thru F7 BF BF BF G upper
not Unicode 11 00 00 thru 1F FF FF

table six
eight-byte UTF-8	ref.	code points
`F0 90 80 80 thru F7 BF BF BF`	G complete as above	`01 00 00 thru 1F FF FF`
`F0 90 80 80 thru F4 8F BF BF`	G lower Unicode	`01 00 00 thru 10 FF FF`
`F4 90 80 80 thru F7 BF BF BF`	G upper not Unicode	`11 00 00 thru 1F FF FF`

§6. Within any encoding, each bit pattern can appear in only certain places, if at all. This fact aids in detecting errors, and in salvaging as much as possible from a detective series of encodings.

table seven

Any sequence of length one has the format 0ddd'dddd.
The extension byte, whose format is 10dd'dddd, will appear in any sequence of at least two bytes.
Further …
this bit pattern cannot appear anywhere in a sequence of under ref.

110d'dddd 2 bytes C

1110'dddd 3 bytes E

1111'0ddd 4 bytes G

1111'10dd 5 bytes I

1111'110d 6 bytes K

1111'1110 7 bytes M

1111'1111 8 bytes O

table seven
Any sequence of length one has the format `0ddd'dddd`. The extension byte, whose format is `10dd'dddd`, will appear in any sequence of at least two bytes. Further …
this bit pattern cannot appear	anywhere in a sequence of under	ref.
`110d'dddd`	2 bytes	C
`1110'dddd`	3 bytes	E
`1111'0ddd`	4 bytes	G
`1111'10dd`	5 bytes	I
`1111'110d`	6 bytes	K
`1111'1110`	7 bytes	M
`1111'1111`	8 bytes	O

If a communications channel summarily intercepts 1111'1111 for a special purpose, the encoding will be limited to seven bytes.

§7. The UTF-8 patterns marked "overlong" are comprehensively prohibited because any use of them would greatly complicate the algorithms for comparison. When the prohibition is observed, two sequences of UTF-8-encoded Unicode characters can be lexicographically compared byte by byte for the relations equal-to, less-than, and greater-than. The comparison algorithm does not have to know the structure of a UTF-8 encoding; instead it need merely examine the raw byte values one after another.

Also, a naïve use of the prohibited patterns could result in multiple encodings for the same character. The probable errors and possible security risks require that software reading UTF-8 sequences must always detect and reject overlong encodings. Table eight gives examples of misuse, employing the percent sign (%), whose Unicode code point is 0010'0101.

table eight

UTF-8 for % ref. comment

z010'0101 A valid

wwz0'0000 wz10'0101 B overlong

wwwz'0000 wz00'0000 wz10'0101 D overlong

wwww'z000 wz00'0000 wz00'0000 wz10'0101 F overlong

table eight
UTF-8 for `%`	ref.	comment
`z010'0101`	A	valid
`wwz0'0000 wz10'0101`	B	overlong
`wwwz'0000 wz00'0000 wz10'0101`	D	overlong
`wwww'z000 wz00'0000 wz00'0000 wz10'0101`	F	overlong

§8. The usual definition of UTF-8 assumes that computer memory is organized into memory units of eight bits. However, UTF-8 can be adapted to other sizes of memory units, here called hunks. Some examples are presented in table nine, starting with nine bits per hunk and working down. The practical minumum is around five bits.

table nine

9 bits
per
hunk total bits

8 z'dddd'dddd

13 w'wzdd'dddd w'zddd'dddd

19 w'wwzd'dddd w'zddd'dddd × 2

25 w'wwwz'dddd w'zddd'dddd × 3

31 w'wwww'zddd w'zddd'dddd × 4

37 w'wwww'wzdd w'zddd'dddd × 5

43 w'wwww'wwzd w'zddd'dddd × 6

49 w'wwww'wwwz w'zddd'dddd × 7

56 w'wwww'wwww w'zddd'dddd × 8

8 bits
per
hunk total bits same as the system above

7 zddd'dddd

11 wwzd'dddd wzdd'dddd

16 wwwz'dddd wzdd'dddd × 2

21 wwww'zddd wzdd'dddd × 3

26 wwww'wzdd wzdd'dddd × 4

31 wwww'wwzd wzdd'dddd × 5

36 wwww'wwwz wzdd'dddd × 6

42 wwww'wwww wzdd'dddd × 7

7 bits
per
hunk total bits

6 zdd'dddd

9 wwz'dddd wzd'dddd

13 www'zddd wzd'dddd × 2

17 www'wzdd wzd'dddd × 3

21 www'wwzd wzd'dddd × 4

25 www'wwwz wzd'dddd × 5

30 www'wwww wzd'dddd × 6

6 bits
per
hunk total bits

5 zd'dddd

7 ww'zddd wz'dddd

10 ww'wzdd wz'dddd × 2

13 ww'wwzd wz'dddd × 3

16 ww'wwwz wz'dddd × 4

20 ww'wwww wz'dddd × 5

5 bits
per
hunk total bits

4 z'dddd

5 w'wzdd w'zddd

7 w'wwzd w'zddd × 2

9 w'wwwz w'zddd × 3

12 w'wwww w'zddd × 4

4 bits
per
hunk total bits

3 zddd

3 wwzd wzdd

4 wwwz wzdd × 2

6 wwww wzdd × 3

3 bits
per
hunk total bits

2 zdd

1 wwz wzd

2 www wzd × 2

2 bits
per
hunk total bits

1 zd

0 ww wz

1 bit
per
hunk total bits

0 z

§9. There is a variant of UTF-8 that will support codes up to 65 bits, which may be helpful if Unicode is ever extended from its current 32-bit format to a 64-bit format. Table ten explains it, including regular UTF-8 for comparison. To make the pattern clearer, superscript notation is used for repeated bits, and the extension byte is shown at its place in numerical order.

65-bit
variant table ten ordinary
UTF-8

total bits first byte
of sequence number of
extension bytes first byte
of sequence total bits

7 z d⁷ 0 z d⁷ 7

— w¹zz d⁵ (extension
byte itself) w¹z d⁶ —

5 + 1 × 5 = 10 w¹zw d⁵ 1 w²z d⁵ 5 + 1 × 6 = 11

4 + 2 × 5 = 14 w²zz d⁴ 2 w³z d⁴ 4 + 2 × 6 = 16

4 + 3 × 5 = 19 w²zw d⁴ 3 w⁴z d³ 3 + 3 × 6 = 21

3 + 4 × 5 = 23 w³zz d³ 4 w⁵z d² 2 + 4 × 6 = 26

3 + 5 × 5 = 28 w³zw d³ 5 w⁶z d¹ 1 + 5 × 6 = 31

2 + 6 × 5 = 32 w⁴zz d² 6 w⁷z 6 × 6 = 36

2 + 7 × 5 = 37 w⁴zw d² 7 w⁸ 7 × 6 = 42

1 + 8 × 5 = 41 w⁵zz d¹ 8

1 + 9 × 5 = 46 w⁵zw d¹ 9

10 × 5 = 50 w⁶zz 10

11 × 5 = 55 w⁶zw 11

12 × 5 = 60 w⁷z 12

13 × 5 = 65 w⁷w 13

total bits	first byte of sequence	number of extension bytes	first byte of sequence	total bits
65-bit variant		table ten	ordinary UTF-8
7	`z d⁷`	0	`z d⁷`	7
—	`w¹zz d⁵`	(extension byte itself)	`w¹z d⁶`	—
5 + 1 × 5 = 10	`w¹zw d⁵`	1	`w²z d⁵`	5 + 1 × 6 = 11
4 + 2 × 5 = 14	`w²zz d⁴`	2	`w³z d⁴`	4 + 2 × 6 = 16
4 + 3 × 5 = 19	`w²zw d⁴`	3	`w⁴z d³`	3 + 3 × 6 = 21
3 + 4 × 5 = 23	`w³zz d³`	4	`w⁵z d²`	2 + 4 × 6 = 26
3 + 5 × 5 = 28	`w³zw d³`	5	`w⁶z d¹`	1 + 5 × 6 = 31
2 + 6 × 5 = 32	`w⁴zz d²`	6	`w⁷z`	6 × 6 = 36
2 + 7 × 5 = 37	`w⁴zw d²`	7	`w⁸`	7 × 6 = 42
1 + 8 × 5 = 41	`w⁵zz d¹`	8
1 + 9 × 5 = 46	`w⁵zw d¹`	9
10 × 5 = 50	`w⁶zz`	10
11 × 5 = 55	`w⁶zw`	11
12 × 5 = 60	`w⁷z`	12
13 × 5 = 65	`w⁷w`	13