Regular Expressions are awesome! Or, what I did.

Subject: Regular Expressions are awesome! Or, what I did.
Author: Thoth
Posted on: 2018-07-16 23:55:00 UTC

I was one of the people working on Lost Tales, and I primarily did automatic conversions. I did these conversions using a tool called "regular expressions," which I thought might interest some of the more technically minded among the boarders. Well, the technically-minded boarders who don't know what regular expressions are already. I get the feeling my audience will be vanishingly small... we'll see. Anyways, basically, regular expressions can be described as text search on crack. And also search and replace on crack. Sound interesting? read on.

Normal search functions allow you to search for a precise fragment of text. Maybe you can search out all the instances of "Hello sweetie" in a document. Maybe case-insensitive search if you're really fancy. Now look at this:

/ ... /

That, believe it or not, was a regular expression. They're typically surrounded by slashes when we write them out like this, for reasons which could take too long to explain. Anyways, a dot is regular expression for "any character". So this will find any three characters (doesn't matter what they are: punctuation, letters, numbers, whitespace, whatever) surrounded by spaces. Cool, huh? No? It's basically useless? Okay, try this:

/[0-9]{3}-[0-9]{3}-[0-9]{4}/

This will find any standard-written, dash-separated phone number in a document. How does it work? Well, those brackets ("[" and "]") signify a character group—that is, a list of characters where we want any of them. You can also specify ranges of characters. In this case, I specified the range 0-9. So any digit. The braces say we want a range {3,5} specifies we want between 3 and 5 of whatever came before it. In this case, I gave {3}, and what came before was a character group meaning digits. So we want three digits, then a dash, then three digits, then a dash, then four digits. Simple, no?

Actually, wanting digits is so common there's a neat shorthand: \d. So the previous can be re-written as:

/\d{3}-\d{3}-\d{4}/

Much simpler.

"But Thoth!" I hear you cry. "What if I want to match a literal period? Or maybe I want a literal brace, or something?" Fear not, my friends. Just stick a backslash in front of the character you want. So:

/.{\*/

Will search for ".{*" exactly. Right, here's a real regex I used (sed ERE, for those of you who already know what's up and are wondering) to lowercase all the HTML tags in Lost Tales. Okay, not the actual one, that one had a really nasty bug. This is a version without that bug:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Okay, that's terrifying. Let's take it apart. That 's' at the front isn't a regex thing, it just tells the tool I'm using that I want to do a search and replace, or a "substitution." The g at that back, likewise, tells the tool that I want to replace every piece of text that matches, not just the first. That forward slash in the middle separates what I'm searching for with what I'm replacing it with. When we take all that junk out, and just look at what I'm searching for, we get this:

/<(\/?)([A-Z]+)([^>]*)>/

Okay, that's terrifying. Let's break it down.

<

This just says to find the literal character <. That's it.

(\/?)

Okay, these parenthesis mark out a capture group. They're not literal parenthesis, they're just here to group together parts of the expression. Also, every capture group is assigned a number. This is capture group 1. Which will only matter much later.

Anyways, what's going on inside? Well, the backslash is a literal escape. So we're looking for the literal character "/". And then there's a questionmark.

The questionmark means "maybe". So, we're saying there might be a slash, or there might not be. If there is, put it in group 1. If not, just put emptiness in group one.

([A-Z]+)

Group 2. We have another range of characters. This time, it's all the capital letters. The plus means "one or more". So we're looking for one or more capital letters, which we'll put in group two.

([^>]*)

Group 3. Another character set. But there's a caret at the start! that means that we actually are looking for the reverse. So, this character set matches anything that is not a ">". And the star means "zero or more." So, if there are any characters that aren't ">", stuff them in group three. If not, don't make a big deal about it.

>

This is just a ">". It means that there's a ">" here.

So if we put it all together, this regular expression first looks for a "" (those characters, or lack thereof, make up group three). Finally, it wants a ">".

Gee, I hope that makes sense. I'm kinda bad at explaining things. Anyways, here's what we're replacing all that with:

<\1\L\2\E\3>

Let's break this down too.

<

Okay, the first thing we're going to put in is a "<".

\1

So, what this means is "whatever was in group one on what you matched." If you'll recall, that was either a forward slash or nothing.

\L\2\E

\L means "turn anything between me and '\E' into lowercase." \2 is group 2, the uppercase letters.

\3

Just put down group 3. That is, recall, stuff between the uppercase letters and a ">"

>

Put down a ">".

So, if we put it all together, this:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Means, "Find every piece of text that starts with a '<', which may or may not be followed by a slash (remember if it is), proceeds with a series of capital letters (remember them), and then whatever else (but remember what's there), followed by a '>', and replace it with a '<', and then a slash if there was one originally, and then that series of capital letters, but lowercase, then whatever came after, and then a '>'"

Phew, that's a mouthful. Aren't you glad you can just write that mess of punctuation instead?

Now, regular expressions (which work more or less as I described, consult your local text editor/tool documentation for variations) don't work everywhere. However, they do work in most programmer's text editors (If you don't use one of those already, then on Windows, I suggest notepad++, and on Mac I'd endorse Atom). If you work with HTML or other plaintext all day, you owe it to yourself to learn them, and there are a lot of excellent online resources to help you do just that.

Hopefully helpful and/or informatively yours,
Thoth.

Reply Return to messages

Jump to this post's position in the thread ↴

Lost Tales CSS overhaul: Mostly Done! by Neshomeh on 2018-07-16 19:32:00 UTC Reply

Hey, everyone! Thanks to the efforts of the Lost Tales team, including Hieronymus Graubart, Tomash, and Thoth, PPC: The Lost Tales has now been almost completely recoded in CSS and has a new appearance!

The only section we haven't touched yet is Anamia's Corner, because it's a big, tangled mess and looking at the current code makes my eyes water.

SIELU, the JAAKSONS, and the PPC Handbook are as close to their original appearances as I could get them. There is necessarily some slight variation. Where possible, all pages have a link to their original version on Wayback Machine or elsewhere, down in the footer. (This should be the case across the site, but may not be yet. There are a lot of links to track down.) Feel free to check it out if you want to compare. If you have any pointers, I'd love to hear them.

Actually, any and all feedback would be greatly appreciated. Please tell me about broken links, things not displaying properly, typos, or anything else you notice. Even four people can't catch everything on a project this big.

~Neshomeh will get to Anamia's Corner this decade, she promises.

Aaaand done! by Neshomeh on 2018-07-19 22:20:00 UTC Reply

Anamia's Corner has been rendered into code that doesn't offend every sensibility I have! And it wasn't even as hard as I thought it would be! {= D

That's not to say there's nothing left that needs doing—again, with something this big, there are bound to be things we've missed—but that's down to the level of minor errors, inconsistencies, and things that just personally bug me. For practical purposes, the task is done. I might even see my way toward adding content again, since it will be so much easier now.

But not right away. Baron Hieronymus needs to pick out his new accommodations first. ^_^

~Neshomeh

...Which means I may have a project. by Thoth on 2018-07-20 02:42:00 UTC Reply

Because I see no reason why the index.html and the actual story files for any new additions to LT can't have some autogenerated aspects. I mean, we can't entirely autogenerate all of them, but extracting stories from HTML files and putting them in the new format should be semi-viable, and automatically creating a skeleton index.html (just a list of stories) and maybe even next/previous links should reasonably simple. So I might do that.

~~Thoth has been working at a company that is almost actively hostile to scripting and automation and for crying out loud just wants things to make sense again.

Color me intrigued. by Neshomeh on 2018-07-20 16:04:00 UTC Reply

The less manual work I have to do, the more I can get done, so sure! I appreciate things that are both effective and efficient. Harper and Smith team-ups: always a good idea. ^_~

~Neshomeh

See-as-as? Retchulah ickspreshuns? What is this magic? by Hieronymus the hermit on 2018-07-19 21:31:00 UTC Reply

New Tales of Hieronymus

Previously in this series: The Stubborn Knight (revised version).

The Archivist’s New Apprentices

Out of nowhere – or probably: elsewhere – Hieronymus the hermit, part-time knight and archivist’s apprentice, appeared in the courtyard of Castle Archive. He stayed there for a minute, marveling in the beauty of the place. Then he turned to the portal that led to a hallway, a flight of stairs and – beyond the stair, on the second floor – the corner where he used to store his staff and flask while he was at home. But he was stopped in his track, finding the portal locked and a message attached to the heavy wood:

Stairs and corners out of service due to renovation and refurnishing.
Please join us at Los Taelis.
We have a chamber prepared for you.

For once, the Kar'eer Forest stayed out of the way and Hieronymus did not get lost again; he crossed the Turaipod Heights rather smoothly and arrived at the fonts of the Kattekri-tri in no time. Fort Los Taelis had been established through contributions from many knights of olden times, conglomerated in ways that did not help to make navigating the place easy. Headquarters appeared to be the obvious starting point, but Hieronymus had no idea how to proceed from there to find Baron Neshomeh or any hints at his new chamber. He did not need to worry; help was on the way.

Out of nowhere – or probably, Genie space – the shape of a glowing red-gold fox appeared. "Welcome back, Sir Hieronymus," the Djenni said, with the echo of a distant whisper that implied speaking for the Lady. "We are currently rebuilding Los Taelis in a more consistent manner. The new magical signposts at Headquarters, based on a design proposed by Master Huinesoron, will help you find your way around."

The hermit had already wondered why Los Taelis looked so different from what he remembered from his previous visits. It was still mostly made of brick-sized books, with an occasional scroll used as a sill or lintel, but gables and thresholds appeared to be more ornamented in ways that made the buildings, despite their different functions, look more similar. And the whole thing looked unfinished, and somewhat alive, with bricks moving around, scrolls unrolling and rerolling themselves, and ornaments appearing in unlikely places, only to suddenly dash off to where they actually belonged.

While looking around, Hieronymus noticed a desk set amidst the central yard, covered in construction plans, sketches and scribbles. A nondescript figure sat on a stool at said desk, shrouded in magic.

"Did you meet Sir Thoth?" asked the Djenni, the echo gone from her voice. "He is in charge of everything that can be done automagical."

Sir Thoth hunched forward, his fingers moving rapidly over diagrams and formulae, and another series of ornamented bricks were hurled about, some right through the flinching hermit’s chest.

"Don’t worry," the Djenni said, baring her teeth in what probably was meant to be a smile. "Weaving his magic, Protector Tomash set up a multitude of parallel realities, so that you won’t get into each other’s way. In the end, all your achievements will be combined and moved to the true Los Taelis."

Hieronymus was so used to weird stuff happening to him, he did not even bother to check for any holes in his body or his robes. But something else worried him. "Protector Tomash? Sir Thoth? All the advanced magic and automation? Are you going to replace me?”

"Did you not hear us say ‘you’?" The whispering echo returned to the Djenni’s voice as she spoke. "There is still a lot of tedious manual work to do for our oldest apprentice."

DISCLAIMER: The Djenni belongs to Neshomeh and is used here based on non-objection. Baron Neshomeh, Baron Huinesoron the historian (and, in the context of this tale, a master archivist like Neshomeh), Protector Tomash, Sir Thoth and Hieronymus the hermit belong to the respective boarders they represent.

A/N: If Los Taelis had existed at the time, Hieronymus the hermit would have asked for a corner there rather than in Castle Arkive, so it was time to move. But since a Corner (Anamia’s) already exists at Los Taelis, he may deserve an accommodation upgrade.

Next in this series: The Baron on the Rock, Part 1, Part 2, which is actually Part 3 of these New Tales.

HG

I am Baron Neshomeh and I approve this story. by Neshomeh on 2018-07-20 16:18:00 UTC Reply

I love the description of Los Taelis and things magically floating around and dashing off. Very Hogwarts. ^_^

The one thing that confuses me is the magical signposts based on a design proposed by Huinesoron. What does that refer to? And in this context, what's Headquarters? (This is potentially just me being dense.)

Baron Hieronymus definitely needs better accommodations for when he's staying at Los Taelis. A proper chamber, perhaps. With an office. A nice desk, plenty of quills and ink for all that tedious clerking. {= ) In fact... since I've got a couple of wizards now, what would you say to being promoted to Head Cleric? Seems more fitting for your station.

~(Baron) Neshomeh

Is it the sortable table? by Huinesoron on 2018-07-20 16:55:00 UTC Reply

I know I've played around with those on the Wiki, and I have vague memories of maybe suggesting you could do it? Only I can't find any evidence of that, so maybe I also... didn't.

Regardless, Baron Huinesoron is happy to claim credit. It'll help him get over the fact that you finished redecorating before he did. (Look, I'm getting there...!)

hS

That's my best guess, too. by Neshomeh on 2018-07-21 00:12:00 UTC Reply

Though I'm also not sure if you suggested it, or if I got the idea from yours on the wiki, or what. I do know I eventually found a sort of pre-packaged SortTable Javascript out on the interwebs and incorporated it into the Songbook, which I had been wanting to do almost forever, and that's the same thing I have on TLT.

But sure, if Baron hS wants to take the credit for doing it first, that is technically correct. The very best kind of correct. ^_~

~Neshomeh

Yes, itÂ’s the SorTable. by Hieronymus Graubart on 2018-07-21 22:18:00 UTC Reply

Huinesoron mentioned it on the Board here. Apparently my faulty memory told me that hS had answered to you when it was actually Zingenmir.

"Headquarters" is meant to be the organizational center (the main page with the sortable index). Since Los Taelis is a military establishment in-universe, I wrecked my brain for a military sounding word. In German, the obvious choice would be "Kommandantur". The only translations my dictionary suggests are "headquarters" or "commander’s office". I’m not happy with the former, but I dislike the latter more because it’s two words, I associate "office" with a room rather than a building and it implies the existence of a specific "commander" occupying that room (who would that be?).

Head Cleric? Head Clerk? Nah, too much honor. Although I’m a Baron now, I can’t tell Sir Thoth and Sir Tomash what to do (and in Real Life, I always preferred a staff position over a line position). Just being the oldest apprentice fits the hermit quite well for now.

HG

How about CIC? by Huinesoron on 2018-07-21 23:10:00 UTC Reply

Combat/Command Information Centre, aka 'what Battlestar calls their bridge'.

The 'command' variant would work better; only issue is that it really is intended for a combat situation.

'Operations' is also viable, or just 'Command'. That's valid use in English, where people protest outside Parliament for example.

hS

"Command" appears to be the word IÂ’m looking for. (nm) by Hieronymus Graubart on 2018-07-21 23:35:00 UTC Reply

... We could call it Landing. by Neshomeh on 2018-07-22 15:26:00 UTC Reply

The main page of a site is sometimes referred to as the landing page. Also, this appeals to the Pern fan in me. ^_^

The Head Clerk/Cleric wouldn't be in charge of the wizards. Just other clerks. Basically, you'd have the theoretical authority to delegate the tedious jobs I've delegated to you. Doesn't mean you'd have to use it, though.

Up to you, of course. {= )

~Neshomeh

IÂ’ll go with Landing then. by Hieronymus Graubart on 2018-07-23 14:23:00 UTC Reply

And my inner Slytherin just realized that, if I ever wanted a promotion, I could just claim that Hieronymus the hermit is legitimately a journeyman, because he journeyed to Manyuel to learn from Master Huinesoron the Historian how to do the (insert plortification of mouseover text) and generally maintain a Cyclopedia.

But eternally being a humble apprentice may actually be more fun.

HG

Mayhap Journeyman? (nm) by Scapegrace on 2018-07-21 22:36:00 UTC Reply

ThatÂ’s the obvious next step. by Hieronymus Graubart on 2018-07-21 23:56:00 UTC Reply

But we’re in a medieval setting, and thus the old tradition of actually journeying journeymen would require to leave Master Neshomeh and find work with some other master archivists. And since these tales reflect real life and Master Huinesoron just turned down Thoth’s offer...

HG

... Stayingathomeman? (nm) by Scapegrace on 2018-07-23 11:20:00 UTC Reply

I'd be willing help, if you need any, and I've time... (nm) by Sir Thoth, Wizard For Hire on 2018-07-20 17:09:00 UTC Reply

Your offer is appreciated. by Baron Huinesoron on 2018-07-20 18:54:00 UTC Reply

But I fear the work that is yet to do cannot be passed to others; it is down to filing and the craft of art.

(Also I'm stubborn as rocks and am danged well going to finish this insane project myself. Oh, the trouble I store up for myself...)

hS

Nice! by Huinesoron on 2018-07-18 12:09:00 UTC Reply

I especially approve of the links to the original pages - I should probably do that to, though at my workrate hahaha.

hS

I thought you'd like that. {= ) by Neshomeh on 2018-07-18 15:07:00 UTC Reply

There were a few things I did change because the old way hurt my brain and I thought being true to the intent and the content was more important than being true to crappy old HTML, but you are the voice in the back of my head I have to argue with anytime I make a decision like that. ^_~

So, I give you transparency in addition to making things better, and hopefully everyone is happy.

~Neshomeh

(Speaking of workrate, any chance of getting back to our cowrite anytime soon...?)

Oh, stars, cowrites. by Huinesoron on 2018-07-18 15:38:00 UTC Reply

You realise I still have an unfinished cowrite with Lily Winterwood, right? Come to think of it, the only reason Brown DragonRider is 'finished' is that I stuck a rough endcap on it and posted it. Cowrites are emphatically not my thing.

I'll try and have a poke at it. ;)

hS

That endcap was not 'rough'. It was 'adorable'. :V (nm) by S.M.F. on 2018-07-18 17:29:00 UTC Reply

Well, thank you. :) (nm) by Huinesoron on 2018-07-19 10:09:00 UTC Reply

Regular Expressions are awesome! Or, what I did. by Thoth on 2018-07-16 23:55:00 UTC Reply

I was one of the people working on Lost Tales, and I primarily did automatic conversions. I did these conversions using a tool called "regular expressions," which I thought might interest some of the more technically minded among the boarders. Well, the technically-minded boarders who don't know what regular expressions are already. I get the feeling my audience will be vanishingly small... we'll see. Anyways, basically, regular expressions can be described as text search on crack. And also search and replace on crack. Sound interesting? read on.

Normal search functions allow you to search for a precise fragment of text. Maybe you can search out all the instances of "Hello sweetie" in a document. Maybe case-insensitive search if you're really fancy. Now look at this:

/ ... /

That, believe it or not, was a regular expression. They're typically surrounded by slashes when we write them out like this, for reasons which could take too long to explain. Anyways, a dot is regular expression for "any character". So this will find any three characters (doesn't matter what they are: punctuation, letters, numbers, whitespace, whatever) surrounded by spaces. Cool, huh? No? It's basically useless? Okay, try this:

/[0-9]{3}-[0-9]{3}-[0-9]{4}/

This will find any standard-written, dash-separated phone number in a document. How does it work? Well, those brackets ("[" and "]") signify a character group—that is, a list of characters where we want any of them. You can also specify ranges of characters. In this case, I specified the range 0-9. So any digit. The braces say we want a range {3,5} specifies we want between 3 and 5 of whatever came before it. In this case, I gave {3}, and what came before was a character group meaning digits. So we want three digits, then a dash, then three digits, then a dash, then four digits. Simple, no?

Actually, wanting digits is so common there's a neat shorthand: \d. So the previous can be re-written as:

/\d{3}-\d{3}-\d{4}/

Much simpler.

"But Thoth!" I hear you cry. "What if I want to match a literal period? Or maybe I want a literal brace, or something?" Fear not, my friends. Just stick a backslash in front of the character you want. So:

/.{\*/

Will search for ".{*" exactly. Right, here's a real regex I used (sed ERE, for those of you who already know what's up and are wondering) to lowercase all the HTML tags in Lost Tales. Okay, not the actual one, that one had a really nasty bug. This is a version without that bug:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Okay, that's terrifying. Let's take it apart. That 's' at the front isn't a regex thing, it just tells the tool I'm using that I want to do a search and replace, or a "substitution." The g at that back, likewise, tells the tool that I want to replace every piece of text that matches, not just the first. That forward slash in the middle separates what I'm searching for with what I'm replacing it with. When we take all that junk out, and just look at what I'm searching for, we get this:

/<(\/?)([A-Z]+)([^>]*)>/

Okay, that's terrifying. Let's break it down.

<

This just says to find the literal character <. That's it.

(\/?)

Okay, these parenthesis mark out a capture group. They're not literal parenthesis, they're just here to group together parts of the expression. Also, every capture group is assigned a number. This is capture group 1. Which will only matter much later.

Anyways, what's going on inside? Well, the backslash is a literal escape. So we're looking for the literal character "/". And then there's a questionmark.

The questionmark means "maybe". So, we're saying there might be a slash, or there might not be. If there is, put it in group 1. If not, just put emptiness in group one.

([A-Z]+)

Group 2. We have another range of characters. This time, it's all the capital letters. The plus means "one or more". So we're looking for one or more capital letters, which we'll put in group two.

([^>]*)

Group 3. Another character set. But there's a caret at the start! that means that we actually are looking for the reverse. So, this character set matches anything that is not a ">". And the star means "zero or more." So, if there are any characters that aren't ">", stuff them in group three. If not, don't make a big deal about it.

>

This is just a ">". It means that there's a ">" here.

So if we put it all together, this regular expression first looks for a "" (those characters, or lack thereof, make up group three). Finally, it wants a ">".

Gee, I hope that makes sense. I'm kinda bad at explaining things. Anyways, here's what we're replacing all that with:

<\1\L\2\E\3>

Let's break this down too.

<

Okay, the first thing we're going to put in is a "<".

\1

So, what this means is "whatever was in group one on what you matched." If you'll recall, that was either a forward slash or nothing.

\L\2\E

\L means "turn anything between me and '\E' into lowercase." \2 is group 2, the uppercase letters.

\3

Just put down group 3. That is, recall, stuff between the uppercase letters and a ">"

>

Put down a ">".

So, if we put it all together, this:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Means, "Find every piece of text that starts with a '<', which may or may not be followed by a slash (remember if it is), proceeds with a series of capital letters (remember them), and then whatever else (but remember what's there), followed by a '>', and replace it with a '<', and then a slash if there was one originally, and then that series of capital letters, but lowercase, then whatever came after, and then a '>'"

Phew, that's a mouthful. Aren't you glad you can just write that mess of punctuation instead?

Now, regular expressions (which work more or less as I described, consult your local text editor/tool documentation for variations) don't work everywhere. However, they do work in most programmer's text editors (If you don't use one of those already, then on Windows, I suggest notepad++, and on Mac I'd endorse Atom). If you work with HTML or other plaintext all day, you owe it to yourself to learn them, and there are a lot of excellent online resources to help you do just that.

Hopefully helpful and/or informatively yours,
Thoth.

Ooh, thank you! =D /is glad to be informed. (nm) by S.M.F. on 2018-07-17 00:24:00 UTC Reply

Separate issue found! by S.M.F. on 2018-07-16 20:09:00 UTC Reply

(Since the other matter seems to just be a loading issue...)

The back button on mission two of the TOS links back to itself, instead of the first mission.

Fixed. Thanks! (nm) by Neshomeh on 2018-07-16 20:25:00 UTC Reply

The background style feels like home. <3 by S.M.F. on 2018-07-16 19:58:00 UTC Reply

That said, it appears to not be working on this page.

Going through the links in order, starting with the TOS, to make sure all the links DO go somewhere. ;) That's the first one I hit upon. 'S good work!

Thank you to Nesh and everyone else who contributed to this! =D

* by S.M.F. on 2018-07-16 20:05:00 UTC Reply

(Well, when I clicked the link posted on the 8th chapter anyway. In all other cases it seems to work.)

New Tales of Hieronymus

The Archivist’s New Apprentices