Subject: Why are people in the archives talking about Pokйmon
Author:
Posted on: 2016-07-30 04:07:00 UTC
(and in general, what's going on with the weird characters)
I'm going to start with a computer history lesson before I admit to silly mistakes made years ago. Part of this is that I'm interested in all this stuff generally. Scroll rapidly if you don't mind being slightly confused.
Internally, computers have to represent the letters we post on the Board as numbers in order to do anything with them. This causes a problem: how do we assign numbers to letters? If two computers don't agree on what the numbers they're sending each other represent, you can get weird garbage text.
In the olden days, most of the Engilsh-speaking world standardized on ASCII (after some politics), which uses numbers between 0 and 27-1 (127) to represent the characters available on a standard US keyboard and some other stuff.
This choice posed a problem for people who wrote with letters other than A-Z, or who didn't use $ for their currency, etc. They obviously wanted the computer to be able to use their characters, and they also wanted to talk with English-speaking people (and/or use their programs) without too much pain.
To solve this problem (at least around Europe[1]), people realized that bytes could go up to 255, and ASCII stopped at 127. This meant that they could fit any missing symbols they needed between 128 and 255.
Now this was a great idea, except that you couldn't fit all the missing letters into the 128 available slots. So, people in different parts of the world simply put in the symbols they cared about and ignored everyone else. And a great confusion was thereby loosed upon the Earth. If you took an email written in one of the Russian encodings and interpreted the bytes with a Western European encoding, it would look like garbage (except for the English). In the worst case, you would have to keep reinterpreting a document with every encoding you could think of until it made sense. Worse, you couldn't mix, say, Greek and Dutch in one document. Madness!
The solution to multiple competing standards is, of course, another standard. A while back, the Unicode consortium decided that the problem was this one-byte thing, and published Unicode. The theory behind Unicode is simple: any character that people have used to write text, or any character that has appeared in a character encoding used by computers somewhere[2], should be given a number (or sequence of numbers, in some cases). The committee has given themselves the numbers between 0 and 221-1 to work with, and they've got plenty of room left.
Then, there was the question of how to represent this Unicode thing in computers. Just storing a sequence of 4-byte numbers seemed wasteful to English speakers (especially since any text would be 3/4ths 0 bytes), which would have killed the project. It also wasn't compatible with ASCII. The committee then came up with a clever solution called UTF-8, which uses [128,255] in a clever way to represent all of Unicode in a way that makes all ASCII text valid UTF-8[3].
Now on to our problems here on the Board. We (obviously) mostly post in ASCII, which is contained by most legacy European encodings and also UTF-8. The Board declares that the pages it sends us are in an encoding called "Windows-1252", which contains several symbols and many accented latin letters on top of ASCII.
However, our posts are all over the place. Sometimes, we post our non-English things in UTF-8. Sometimes, we post in Windows-1252. This is all up to our web browsers, and we can't really do much about it as far as I can tell.
So, when I was scraping the Board, I had to deal with all this. T-Board, like a lot of recent software, stores all its text as UTF-8-encoded Unicode. So, I had to convert the Board posts to UTF-8. However, this would get weird if the posts were already UTF-8. I decided to try to interpret every post as Windows-1252, and re-encode it as UTF-8. If the process didn't work, the text was already in UTF-8, and we could move on. Failed applications of this rule is where all the  and В come from, since some UTF-8 is also valid Windows-1252. If you want to fix them, see the runnable one-liner I created at this link.
This all sound straightforward, but there was a problem. When I wrote the archive script, I made a one-character typo in the encoding-related code. This resulted in, among other things, our love of "Pokйmon" instead of "Pokémon". I have fixed the results of this mess in my copy of the Board scrape, and will probably be posting re-encoded archives with the 2016-2017 archive season (that is, next summer)[4].
Now, I would like to offer a plate of session cookies to the first person that figures out what typo I made. Hint: look at row E of this chart and also this chart.
tl;dr: Tomash is a huge nerd, and also makes silly mistakes. Programming can be tricky, folks.
[1]: I don't know much about historical Asian, Arabic, etc. encodings, so I don't really know what they came up with
[2]: This is so that, you can go from encoding X to Unicode to X to Unicode ... without losing any information ever.[2.1]
[2.1]: This is where emoji came from. Japanese cell phone carriers encoded emoji, each in their own way, and the Unicode committee eventually promulgated a standard that all the carriers could agree on. Then, smartphone vendors had to support the Unicode group's decision, the characters became available on Western phones out of sheer laziness, and here we are.
[3]: This meant many Asian scripts went from a system with two bytes for many characters to three, but you can't please everyone, sadly.
[4]: Unless someone really wants them now. In that case, please ask.