Plain Text (Dylan Beattie)

(Actual technical content starts ~10:30.) I am going into this talk expecting it to be roughly a equivalent to Joel Spolsky’s classic “There Ain’t No Such Thing As Plain Text”.
In ASCII, you can transform a character that represents a number into its integer value by masking the top four bits.

ASCII value	Binary	Digit
48	`0b0011 0000`	0
49	`0b0011 0001`	1
50	`0b0011 0010`	2
51	`0b0011 0011`	3
52	`0b0011 0100`	4
53	`0b0011 0101`	5
54	`0b0011 0110`	6
55	`0b0011 0111`	7
56	`0b0011 1000`	8
57	`0b0011 1001`	9

In ASCII, uppercase and lowercase letters differ only by the 6th bit, so case insensitive comparison can be done by simply masking the 6th bit.

ASCII value	Binary	Value
65	`0b0100 0001`	A
97	`0b0110 0001`	a
66	`0b0100 0010`	B
98	`0b0110 0010`	b
67	`0b0100 0011`	C
99	`0b0110 0011`	c

Many information systems would strip the 8th bit from data it believed to be ASCII, which was a problem for other text encodings that used the 8th bit. The USSR developed an encoding KOI8-R (Код Обмена Информацией, 8 бит), which tended to place Cyrillic letters in the same code point as Roman letters with a similar sound. As a result, if the 8th bit is stripped from text encoded in KOI8-R and the result is interpreted as ASCII, the text still might be intelligible to a Russian reader.
Reference to classic Russian mojibake package story. https://www.tumblr.com/wizardishungry/31884397775/an-image-of-a-post-envelope-with-address-written
Unicode. Some discussion of “what is a letter?” How do you convert ‘almost Roman’, like ‘Å’ or ‘Ü’ to ASCII? How do your sort strings that contain such characters? Answer: Alphabetization is simply locale dependant.
Discussion of Unicode normalization. For example, Unicode has ç (U+00E7 Latin Small Letter C with Cedilla), but you can also make the same thing by combining c (U+0063 Latin Small Letter C ) with ̧ (U+0327 Combining Cedilla). The former has a UTF-8 encoding of 0xC3A7, while the latter has a UTF-8 encoding of 0x63CCA7. So how do you do string equality when glyphs have multiple representations? Answer: You have to pick a “normalization”.
Overall, not much new for me in this talk. I don’t remember Spolsky’s article well enough to say if the talk has anything that is not in the article.

Jon Shea

Plain Text (Dylan Beattie)