Why are people in the archives talking about PokÐ¹mon

Subject: Why are people in the archives talking about PokÐ¹mon
Author: Tomash
Posted on: 2016-07-30 04:07:00 UTC

(and in general, what's going on with the weird characters)
I'm going to start with a computer history lesson before I admit to silly mistakes made years ago. Part of this is that I'm interested in all this stuff generally. Scroll rapidly if you don't mind being slightly confused.

Internally, computers have to represent the letters we post on the Board as numbers in order to do anything with them. This causes a problem: how do we assign numbers to letters? If two computers don't agree on what the numbers they're sending each other represent, you can get weird garbage text.

In the olden days, most of the Engilsh-speaking world standardized on ASCII (after some politics), which uses numbers between 0 and 2⁷-1 (127) to represent the characters available on a standard US keyboard and some other stuff.

This choice posed a problem for people who wrote with letters other than A-Z, or who didn't use $ for their currency, etc. They obviously wanted the computer to be able to use their characters, and they also wanted to talk with English-speaking people (and/or use their programs) without too much pain.

To solve this problem (at least around Europe[1]), people realized that bytes could go up to 255, and ASCII stopped at 127. This meant that they could fit any missing symbols they needed between 128 and 255.

Now this was a great idea, except that you couldn't fit all the missing letters into the 128 available slots. So, people in different parts of the world simply put in the symbols they cared about and ignored everyone else. And a great confusion was thereby loosed upon the Earth. If you took an email written in one of the Russian encodings and interpreted the bytes with a Western European encoding, it would look like garbage (except for the English). In the worst case, you would have to keep reinterpreting a document with every encoding you could think of until it made sense. Worse, you couldn't mix, say, Greek and Dutch in one document. Madness!

The solution to multiple competing standards is, of course, another standard. A while back, the Unicode consortium decided that the problem was this one-byte thing, and published Unicode. The theory behind Unicode is simple: any character that people have used to write text, or any character that has appeared in a character encoding used by computers somewhere[2], should be given a number (or sequence of numbers, in some cases). The committee has given themselves the numbers between 0 and 2²¹-1 to work with, and they've got plenty of room left.

Then, there was the question of how to represent this Unicode thing in computers. Just storing a sequence of 4-byte numbers seemed wasteful to English speakers (especially since any text would be 3/4ths 0 bytes), which would have killed the project. It also wasn't compatible with ASCII. The committee then came up with a clever solution called UTF-8, which uses [128,255] in a clever way to represent all of Unicode in a way that makes all ASCII text valid UTF-8[3].

Now on to our problems here on the Board. We (obviously) mostly post in ASCII, which is contained by most legacy European encodings and also UTF-8. The Board declares that the pages it sends us are in an encoding called "Windows-1252", which contains several symbols and many accented latin letters on top of ASCII.

However, our posts are all over the place. Sometimes, we post our non-English things in UTF-8. Sometimes, we post in Windows-1252. This is all up to our web browsers, and we can't really do much about it as far as I can tell.

So, when I was scraping the Board, I had to deal with all this. T-Board, like a lot of recent software, stores all its text as UTF-8-encoded Unicode. So, I had to convert the Board posts to UTF-8. However, this would get weird if the posts were already UTF-8. I decided to try to interpret every post as Windows-1252, and re-encode it as UTF-8. If the process didn't work, the text was already in UTF-8, and we could move on. Failed applications of this rule is where all the Â and В come from, since some UTF-8 is also valid Windows-1252. If you want to fix them, see the runnable one-liner I created at this link.

This all sound straightforward, but there was a problem. When I wrote the archive script, I made a one-character typo in the encoding-related code. This resulted in, among other things, our love of "Pokйmon" instead of "Pokémon". I have fixed the results of this mess in my copy of the Board scrape, and will probably be posting re-encoded archives with the 2016-2017 archive season (that is, next summer)[4].

Now, I would like to offer a plate of session cookies to the first person that figures out what typo I made. Hint: look at row E of this chart and also this chart.

tl;dr: Tomash is a huge nerd, and also makes silly mistakes. Programming can be tricky, folks.

[1]: I don't know much about historical Asian, Arabic, etc. encodings, so I don't really know what they came up with

[2]: This is so that, you can go from encoding X to Unicode to X to Unicode ... without losing any information ever.[2.1]

[2.1]: This is where emoji came from. Japanese cell phone carriers encoded emoji, each in their own way, and the Unicode committee eventually promulgated a standard that all the carriers could agree on. Then, smartphone vendors had to support the Unicode group's decision, the characters became available on Western phones out of sheer laziness, and here we are.

[3]: This meant many Asian scripts went from a system with two bytes for many characters to three, but you can't please everyone, sadly.
[4]: Unless someone really wants them now. In that case, please ask.

Reply Return to messages

Jump to this post's position in the thread ↴

Announching the other way to browse the Board archives by Tomash on 2016-07-20 04:16:00 UTC Reply

AKA, the altchives.

I fed all the old posts through a variant on T-Board's "show thread" machinery, splitting them by the date of the initial post to make the pages smaller. I then uploaded the results, which should be easier to navigate and read than the big spreadsheet. For example, here's my Permission request, if you want to journey to 2011 (!).

- Tomash, who figures the Github guys won't notice this unless we all download every page a thousand times

Why are people in the archives talking about PokÐ¹mon by Tomash on 2016-07-30 04:07:00 UTC Reply

(and in general, what's going on with the weird characters)
I'm going to start with a computer history lesson before I admit to silly mistakes made years ago. Part of this is that I'm interested in all this stuff generally. Scroll rapidly if you don't mind being slightly confused.

Internally, computers have to represent the letters we post on the Board as numbers in order to do anything with them. This causes a problem: how do we assign numbers to letters? If two computers don't agree on what the numbers they're sending each other represent, you can get weird garbage text.

In the olden days, most of the Engilsh-speaking world standardized on ASCII (after some politics), which uses numbers between 0 and 2⁷-1 (127) to represent the characters available on a standard US keyboard and some other stuff.

This choice posed a problem for people who wrote with letters other than A-Z, or who didn't use $ for their currency, etc. They obviously wanted the computer to be able to use their characters, and they also wanted to talk with English-speaking people (and/or use their programs) without too much pain.

To solve this problem (at least around Europe[1]), people realized that bytes could go up to 255, and ASCII stopped at 127. This meant that they could fit any missing symbols they needed between 128 and 255.

Now this was a great idea, except that you couldn't fit all the missing letters into the 128 available slots. So, people in different parts of the world simply put in the symbols they cared about and ignored everyone else. And a great confusion was thereby loosed upon the Earth. If you took an email written in one of the Russian encodings and interpreted the bytes with a Western European encoding, it would look like garbage (except for the English). In the worst case, you would have to keep reinterpreting a document with every encoding you could think of until it made sense. Worse, you couldn't mix, say, Greek and Dutch in one document. Madness!

The solution to multiple competing standards is, of course, another standard. A while back, the Unicode consortium decided that the problem was this one-byte thing, and published Unicode. The theory behind Unicode is simple: any character that people have used to write text, or any character that has appeared in a character encoding used by computers somewhere[2], should be given a number (or sequence of numbers, in some cases). The committee has given themselves the numbers between 0 and 2²¹-1 to work with, and they've got plenty of room left.

Then, there was the question of how to represent this Unicode thing in computers. Just storing a sequence of 4-byte numbers seemed wasteful to English speakers (especially since any text would be 3/4ths 0 bytes), which would have killed the project. It also wasn't compatible with ASCII. The committee then came up with a clever solution called UTF-8, which uses [128,255] in a clever way to represent all of Unicode in a way that makes all ASCII text valid UTF-8[3].

Now on to our problems here on the Board. We (obviously) mostly post in ASCII, which is contained by most legacy European encodings and also UTF-8. The Board declares that the pages it sends us are in an encoding called "Windows-1252", which contains several symbols and many accented latin letters on top of ASCII.

However, our posts are all over the place. Sometimes, we post our non-English things in UTF-8. Sometimes, we post in Windows-1252. This is all up to our web browsers, and we can't really do much about it as far as I can tell.

So, when I was scraping the Board, I had to deal with all this. T-Board, like a lot of recent software, stores all its text as UTF-8-encoded Unicode. So, I had to convert the Board posts to UTF-8. However, this would get weird if the posts were already UTF-8. I decided to try to interpret every post as Windows-1252, and re-encode it as UTF-8. If the process didn't work, the text was already in UTF-8, and we could move on. Failed applications of this rule is where all the Â and В come from, since some UTF-8 is also valid Windows-1252. If you want to fix them, see the runnable one-liner I created at this link.

This all sound straightforward, but there was a problem. When I wrote the archive script, I made a one-character typo in the encoding-related code. This resulted in, among other things, our love of "Pokйmon" instead of "Pokémon". I have fixed the results of this mess in my copy of the Board scrape, and will probably be posting re-encoded archives with the 2016-2017 archive season (that is, next summer)[4].

Now, I would like to offer a plate of session cookies to the first person that figures out what typo I made. Hint: look at row E of this chart and also this chart.

tl;dr: Tomash is a huge nerd, and also makes silly mistakes. Programming can be tricky, folks.

[1]: I don't know much about historical Asian, Arabic, etc. encodings, so I don't really know what they came up with

[2]: This is so that, you can go from encoding X to Unicode to X to Unicode ... without losing any information ever.[2.1]

[2.1]: This is where emoji came from. Japanese cell phone carriers encoded emoji, each in their own way, and the Unicode committee eventually promulgated a standard that all the carriers could agree on. Then, smartphone vendors had to support the Unicode group's decision, the characters became available on Western phones out of sheer laziness, and here we are.

[3]: This meant many Asian scripts went from a system with two bytes for many characters to three, but you can't please everyone, sadly.
[4]: Unless someone really wants them now. In that case, please ask.

(Left off the question mark in the subject) (nm) by Tomash on 2016-07-30 04:15:00 UTC Reply

You are awesome. by Neshomeh on 2016-07-21 17:35:00 UTC Reply

My only quibble is that it's hard to tell where one thread ends and another begins, and what order the posts were made in... but heck, it's better than the spreadsheet. At least this doesn't take half a day to load! {= D

~Neshomeh, who will update wiki links as she finds them and encourages others to do the same.

*squeee* by Tomash on 2016-07-21 17:47:00 UTC Reply

Nesh thinks I'm awesome???

Those horizontal lines could be a bit thicker, maybe.

Also, in a tremendous oversight, I didn't switch the sort order for threads from newest-first to oldest-first while generating archives. I should probably go back and fix that. (and disable ^superscript^ support while archiving, since it ruins people's emoticons).

This won't screw up links, so I'll take a crack at it later this week[end].

Horizontal lines? by Neshomeh on 2016-07-21 21:08:00 UTC Reply

Maybe I'm not looking at what I'm meant to be looking at. When I click on a month in the index, I get a page with gray text on a light gray background, and links in blue. When I click on your Permission thread, though, I get the more familiar black and red on white, with horizontal lines as you mentioned.

On poking further, it seems as though opening a link in a new tab gets me the page rendered in black-red-white, while simply clicking on it gets me gray-blue-gray.

Also, 2008's 06a gives me a 404 error.

Using Firefox on Windows, BTW. I hope that's helpful.

~Neshomeh, rushing a post before going to work.

Archives 1.1 (changelog) by Tomash on 2016-07-22 03:42:00 UTC Reply
- Changed thread order. The threads are sorted such that the one with the oldest initial post is on top (opposite of what the Board looks like). The order of posts within a thread is unchanged.
- Changed the rendering to, among other things, stop eating \^_\^ faces
- Added meaningful page titles
- Removed Turbolinks without prejudice. This will hopefully fix your bug, Nesh

Not quite fixed. by Neshomeh on 2016-07-22 03:56:00 UTC Reply

Everything seems to work as intended from https://technodann.github.io/PPC-board-2.0/archive/

However, from https://technodann.github.io/PPC-board-2.0/archive/index.html (which itself has the gray-blue-gray look), the problem persists.

2008 06a is still broken either way.

~Neshomeh

P.S. hS, you're still awesome, too. You will always be awesome. *hug*

Made one last change by Tomash on 2016-07-22 04:14:00 UTC Reply

(This is odd, because "archive/" and "archive/index.html" are literally the same page)

but I made a small tweak that may or may not help. Try doing a hard refresh (Control+F5) to reload everything and see if it's fixed.

Also,06a never existed because of the limits of the dataset. The link has been removed to avoid confusion.

- Tomash

P.S [insert what Nesh said here]

I've also convinced Google to index these (nm) by Tomash on 2016-07-21 04:01:00 UTC Reply
Oh thank Crow. (nm) by Data Junkie on 2016-07-20 22:56:00 UTC Reply
Wow by sjosten on 2016-07-20 17:15:00 UTC Reply

That's a pretty extensive backlog. Thanks so much for putting it all in one place.
Thank you so much for this. (nm) by Hieronymus Graubart on 2016-07-20 09:12:00 UTC Reply
Bless you. (nm) (nm) by Aeroden on 2016-07-20 04:58:00 UTC Reply

Thanks for the NM&Ms by Tomash on 2016-07-20 05:55:00 UTC Reply

(The "(nm)" is added automatically, so if you add it in thinking it's convention, that happens)

Thanks for the tip. by Aeroden on 2016-07-20 06:24:00 UTC Reply

I figured that out after I posted lD

Thank you for doing this! :D (nm) by SkarmorySilver on 2016-07-20 04:38:00 UTC Reply
This is. SO much easier to read. by Iximaz on 2016-07-20 04:35:00 UTC Reply

Oh jeez, my introductory post was so cringeworthy looking back. Dear lord.

I did the best I could, you know... (nm) by Huinesoron on 2016-07-21 07:42:00 UTC Reply

My format choice a few years ago didn't help there by Tomash on 2016-07-21 13:39:00 UTC Reply

I basically generated a CSV and sent it to people. This certainly didn't make creating a readable archive easy (well, it'd take a longer-side-of-short script to do it).

There were also the dump files, which could be imported into a T-Board instance, except that I (and maybe Juliette long ago) were the only people that ever got those running.

So what I'm saying is that it's not hS's fault that the other archive was hard to read.
Sorry, it's nothing against you. D: (nm) by Iximaz on 2016-07-21 08:45:00 UTC Reply

I know :-/; I just feel a bit like this guy: by Huinesoron on 2016-07-21 11:04:00 UTC Reply

“… have you read it?”

The hooded figure walked through the streets of Borrd, drinking in the hubbub around him.

"The new History, of course, what do you think?"

He cocked his head, smiling slightly in the shadow of his robe. Ah...

"... goes right up to last week! The Lord Protector's really outdone..."

The figure paused, casually leaning against the wall. Audience research was important, right?

"... yup, Sir Tomash really knows his stuff."

"What?!"

The chattering group looked over in alarm as the black-cloaked figure marched over. "Uh, hi?" one of them said.

The mysterious stranger stopped in front of them, folding his arms. "You're talking about the new edition of the Archives of Borrd, correct?"

"Something like that," the Borrder admitted. "Have you... read it?"

The figure snorted. "I should hope so. I thought Baron Huinesoron," he stressed the name slightly, "did an excellent job."

"Baron...?" The Borrder frowned, then her face cleared. "Oh! You're talking about the other one. It's all right, I guess."

They couldn't see the stranger's face, but he still gave off a distinct air of narrowed eyes. "'All right'."

"Yeah." She shrugged, looking to her friends for support. "But it's... well, it's really dry stuff. Sir Tomash, though - he makes it breathe."

The group murmured in the affirmative, with phrases like "clear and crisp prose" and "as if you were actually there" seeping through the general rumbling.

The hooded figure stared at them. "But... but he wasn't even there," he protested. "How can he do the History justice if he never witnessed it?"

"I mean, you're a bit right," the Borrder said. "The Baron's work is still the best reference for the early days."

"The only reference," one of the others muttered, getting nods from half of the group.

"But... I dunno, once he got married to Lady Kaitlyn, it's like he gave up. There's no... sparkle." The Borrder glanced at the stranger. "Am I making sense?"

"I understand your words," the stranger ground out.

She looked at him for a long moment, then shrugged again. "So, yeah. That's what Tomash has. Sparkle. He makes it more..."

"More real. I see." The figure reached up under his hood, rubbing his eyes. "Very well, then. I will trespass no more on your time." He turned on his heel and stalked off into the crowd, lost to sight within seconds.

The Borrder turned to her friends, brow furrowed. "Does anyone have any idea what that was all about?"

The main difference being that he would never acknowledge the Altchives as being better. I do.

hS