(Actual technical content starts ~10:30.) I am going into this talk expecting it to be roughly a equivalent to Joel Spolsky’s classic “There Ain’t No Such Thing As Plain Text”.
In ASCII, you can transform a character that represents a number into its integer value by masking the top four bits.
| ASCII value | Binary | Digit |
|---|---|---|
| 48 | 0b0011 0000 |
0 |
| 49 | 0b0011 0001 |
1 |
| 50 | 0b0011 0010 |
2 |
| 51 | 0b0011 0011 |
3 |
| 52 | 0b0011 0100 |
4 |
| 53 | 0b0011 0101 |
5 |
| 54 | 0b0011 0110 |
6 |
| 55 | 0b0011 0111 |
7 |
| 56 | 0b0011 1000 |
8 |
| 57 | 0b0011 1001 |
9 |
| ASCII value | Binary | Value |
|---|---|---|
| 65 | 0b0100 0001 |
A |
| 97 | 0b0110 0001 |
a |
| 66 | 0b0100 0010 |
B |
| 98 | 0b0110 0010 |
b |
| 67 | 0b0100 0011 |
C |
| 99 | 0b0110 0011 |
c |
Many information systems would strip the 8th bit from data it believed to be ASCII, which was a problem for other text encodings that used the 8th bit. The USSR developed an encoding KOI8-R (Код Обмена Информацией, 8 бит), which tended to place Cyrillic letters in the same code point as Roman letters with a similar sound. As a result, if the 8th bit is stripped from text encoded in KOI8-R and the result is interpreted as ASCII, the text still might be intelligible to a Russian reader.
Unicode. Some discussion of “what is a letter?” How do you convert ‘almost Roman’, like ‘Å’ or ‘Ü’ to ASCII? How do your sort strings that contain such characters? Answer: Alphabetization is simply locale dependant.
Discussion of Unicode normalization. For example, Unicode has ç (U+00E7 Latin Small Letter C with Cedilla), but you can also make the same thing by combining c (U+0063 Latin Small Letter C
) with ̧ (U+0327 Combining Cedilla). The former has a UTF-8 encoding of 0xC3A7, while the latter has a UTF-8 encoding of 0x63CCA7. So how do you do string equality when glyphs have multiple representations? Answer: You have to pick a “normalization”.