(Actual technical content starts ~10:30.) I am going into this talk expecting it to be roughly a equivalent to Joel Spolsky’s classic “There Ain’t No Such Thing As Plain Text”.
In ASCII, you can transform a character that represents a number into its integer value by masking the top four bits.
ASCII value | Binary | Digit |
---|---|---|
48 | 0b0011 0000 |
0 |
49 | 0b0011 0001 |
1 |
50 | 0b0011 0010 |
2 |
51 | 0b0011 0011 |
3 |
52 | 0b0011 0100 |
4 |
53 | 0b0011 0101 |
5 |
54 | 0b0011 0110 |
6 |
55 | 0b0011 0111 |
7 |
56 | 0b0011 1000 |
8 |
57 | 0b0011 1001 |
9 |
ASCII value | Binary | Value |
---|---|---|
65 | 0b0100 0001 |
A |
97 | 0b0110 0001 |
a |
66 | 0b0100 0010 |
B |
98 | 0b0110 0010 |
b |
67 | 0b0100 0011 |
C |
99 | 0b0110 0011 |
c |
Many information systems would strip the 8th bit from data it believed to be ASCII, which was a problem for other text encodings that used the 8th bit. The USSR developed an encoding KOI8-R (Код Обмена Информацией, 8 бит), which tended to place Cyrillic letters in the same code point as Roman letters with a similar sound. As a result, if the 8th bit is stripped from text encoded in KOI8-R and the result is interpreted as ASCII, the text still might be intelligible to a Russian reader.
Unicode. Some discussion of “what is a letter?” How do you convert ‘almost Roman’, like ‘Å’ or ‘Ü’ to ASCII? How do your sort strings that contain such characters? Answer: Alphabetization is simply locale dependant.
Discussion of Unicode normalization. For example, Unicode has ç
(U+00E7 Latin Small Letter C with Cedilla), but you can also make the same thing by combining c
(U+0063 Latin Small Letter C
) with ̧
(U+0327 Combining Cedilla). The former has a UTF-8 encoding of 0xC3A7
, while the latter has a UTF-8 encoding of 0x63CCA7
. So how do you do string equality when glyphs have multiple representations? Answer: You have to pick a “normalization”.