Subject: Regular Expressions are awesome! Or, what I did.
Author:
Posted on: 2018-07-16 23:55:00 UTC

I was one of the people working on Lost Tales, and I primarily did automatic conversions. I did these conversions using a tool called "regular expressions," which I thought might interest some of the more technically minded among the boarders. Well, the technically-minded boarders who don't know what regular expressions are already. I get the feeling my audience will be vanishingly small... we'll see. Anyways, basically, regular expressions can be described as text search on crack. And also search and replace on crack. Sound interesting? read on.

Normal search functions allow you to search for a precise fragment of text. Maybe you can search out all the instances of "Hello sweetie" in a document. Maybe case-insensitive search if you're really fancy. Now look at this:

/ ... /

That, believe it or not, was a regular expression. They're typically surrounded by slashes when we write them out like this, for reasons which could take too long to explain. Anyways, a dot is regular expression for "any character". So this will find any three characters (doesn't matter what they are: punctuation, letters, numbers, whitespace, whatever) surrounded by spaces. Cool, huh? No? It's basically useless? Okay, try this:

/[0-9]{3}-[0-9]{3}-[0-9]{4}/

This will find any standard-written, dash-separated phone number in a document. How does it work? Well, those brackets ("[" and "]") signify a character group—that is, a list of characters where we want any of them. You can also specify ranges of characters. In this case, I specified the range 0-9. So any digit. The braces say we want a range {3,5} specifies we want between 3 and 5 of whatever came before it. In this case, I gave {3}, and what came before was a character group meaning digits. So we want three digits, then a dash, then three digits, then a dash, then four digits. Simple, no?

Actually, wanting digits is so common there's a neat shorthand: \d. So the previous can be re-written as:

/\d{3}-\d{3}-\d{4}/

Much simpler.

"But Thoth!" I hear you cry. "What if I want to match a literal period? Or maybe I want a literal brace, or something?" Fear not, my friends. Just stick a backslash in front of the character you want. So:

/.{\*/

Will search for ".{*" exactly. Right, here's a real regex I used (sed ERE, for those of you who already know what's up and are wondering) to lowercase all the HTML tags in Lost Tales. Okay, not the actual one, that one had a really nasty bug. This is a version without that bug:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Okay, that's terrifying. Let's take it apart. That 's' at the front isn't a regex thing, it just tells the tool I'm using that I want to do a search and replace, or a "substitution." The g at that back, likewise, tells the tool that I want to replace every piece of text that matches, not just the first. That forward slash in the middle separates what I'm searching for with what I'm replacing it with. When we take all that junk out, and just look at what I'm searching for, we get this:

/<(\/?)([A-Z]+)([^>]*)>/

Okay, that's terrifying. Let's break it down.

<

This just says to find the literal character <. That's it.

(\/?)

Okay, these parenthesis mark out a capture group. They're not literal parenthesis, they're just here to group together parts of the expression. Also, every capture group is assigned a number. This is capture group 1. Which will only matter much later.

Anyways, what's going on inside? Well, the backslash is a literal escape. So we're looking for the literal character "/". And then there's a questionmark.

The questionmark means "maybe". So, we're saying there might be a slash, or there might not be. If there is, put it in group 1. If not, just put emptiness in group one.

([A-Z]+)

Group 2. We have another range of characters. This time, it's all the capital letters. The plus means "one or more". So we're looking for one or more capital letters, which we'll put in group two.

([^>]*)

Group 3. Another character set. But there's a caret at the start! that means that we actually are looking for the reverse. So, this character set matches anything that is not a ">". And the star means "zero or more." So, if there are any characters that aren't ">", stuff them in group three. If not, don't make a big deal about it.

>

This is just a ">". It means that there's a ">" here.

So if we put it all together, this regular expression first looks for a "" (those characters, or lack thereof, make up group three). Finally, it wants a ">".

Gee, I hope that makes sense. I'm kinda bad at explaining things. Anyways, here's what we're replacing all that with:

<\1\L\2\E\3>

Let's break this down too.

<

Okay, the first thing we're going to put in is a "<".

\1

So, what this means is "whatever was in group one on what you matched." If you'll recall, that was either a forward slash or nothing.

\L\2\E

\L means "turn anything between me and '\E' into lowercase." \2 is group 2, the uppercase letters.

\3

Just put down group 3. That is, recall, stuff between the uppercase letters and a ">"

>

Put down a ">".

So, if we put it all together, this:

s/<(\/?)([A-Z]+)([^>]*)>/<\1\L\2\E\3>/g

Means, "Find every piece of text that starts with a '<', which may or may not be followed by a slash (remember if it is), proceeds with a series of capital letters (remember them), and then whatever else (but remember what's there), followed by a '>', and replace it with a '<', and then a slash if there was one originally, and then that series of capital letters, but lowercase, then whatever came after, and then a '>'"

Phew, that's a mouthful. Aren't you glad you can just write that mess of punctuation instead?

Now, regular expressions (which work more or less as I described, consult your local text editor/tool documentation for variations) don't work everywhere. However, they do work in most programmer's text editors (If you don't use one of those already, then on Windows, I suggest notepad++, and on Mac I'd endorse Atom). If you work with HTML or other plaintext all day, you owe it to yourself to learn them, and there are a lot of excellent online resources to help you do just that.

Hopefully helpful and/or informatively yours,
Thoth.

Reply Return to messages