‘Some people, when confronted with a problem, think “I know, I’ll use XML.” Now they have two problems.’
– stolen from somewhere
- DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
- SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less ‘terse’ than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn’t include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it’ll save you a lot of code. It’s also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It’s not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn’t have things like ‘getElementsByTagName’ or getElementById’, but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat ‘magical’. The properties it exposes if you var_dump/print_r/var_export don’t correspond to its complete internal representation, and end up making SimpleXML look more simplistic than it really is. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the  operator as if it was an associative array.
You don’t have to fully commit to one or the other, because PHP implements the functions:
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
- XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn’t build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it’s an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don’t need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don’t care about structure.
- Some Creative Commons licenses are ‘free’ in the sense that open source software is free.
- Other Creative Commons licenses are ‘not free’ in the sense that they restrict use of the material in ways that is counter to ‘freedom’ as defined by the Free Software Foundation or the Open Source Initiative, to draw a parallel with software licenses.
In this article I just wanted to clarify the difference for those using a CC license, so that they are not inadvertently preventing others from using their work with an unnecessarily restrictive license.
Thankfully, the creativecommons.org website now has a useful “Approved for Free Cultural Works” icon and colour scheme, to help you tell them apart. For example:
- This Attribution Share-Alike Generic 2.5 license is green and has the icon, so you know it is a ‘free’ license.
- This Attribution Non-Commercial Generic 2.5 license is yellow and does not have the icon, so it is not a ‘free’ (as in freedom) license.
Creativecommons.org has chosen to adopt the meaning of ‘freedom’ as defined by Freedomdefined.org, a definition which is basically equivalent to that used for open source software. It states that for a work to be considered a free cultural work, it must have the following four freedoms:
- The freedom to use the work and enjoy the benefits of using it
- The freedom to study the work and to apply knowledge acquired from it
- The freedom to make and redistribute copies, in whole or in part, of the information or expression
- The freedom to make changes and improvements, and to distribute derivative works
Freedom applies to everyone
For these freedoms to be valid, they must be unconditional and apply to everyone, regardless of who they are or what they intend to use the work for. This means that any license with a non-commercial clause is not free in the sense that any business wanting to use the work commercially would have to make a separate arrangement with the author. One of the basic rules of open source software is that businesses are allowed to use it in order to profit from it; if what they could do with it was restricted, open source software would be avoided by commercial enterprises. Companies wouldn’t be installing Linux on their clients’ systems, for example.
The same applies to non-software cultural works: allowing anyone the freedom to use the work, regardless of whether they intend to profit, enables businesses to assist the proliferation of the work.
Freedom includes freedom to make changes
These freedoms also include the freedom to make changes and improvements. If a license does not allow derivative works, it is another example of restricting users’ ability to do whatever they like with the work. The ability to modify the work is seen as advantageous for the community because it allows the work to be improved by others, without a separate arrangement being made with the original author. To draw a comparison with open source software again, if businesses were not allowed to modify Linux and provide their own version of it, many businesses would not be able to exist, and the behaviour of Linux would be entirely under the control of a single entity. Allowing others to modify your work allows businesses to exist that support the work through improving it.
Some restrictions are still acceptable
Requiring copyright notices to be preserved, or requiring any derivative works to be given the same license (a share-alike clause) are still considered acceptable restrictions by the free software movement and free cultural works. It’s just that any restrictions beyond this, such as preventing commercial uses and preventing any modifications, are not.
Quick guide to choosing a Creative Commons license
There is nothing wrong with choosing a non-free license for your work: It is the creator’s right not to license their work, or to apply any restrictions they desire. If you are considering releasing something under a Creative Commons license, you should consider which rights you want to retain. One reason for retaining a right would be if you want to make money from it.
So, here’s a quick guide on how to choose between the licenses:
- Including a non-commercial clause allows you to retain the sole right to make money from distributing the work. If allowing others freedom to use the work is more important to you than making money, then don’t include a non-commercial clause.
- Not allowing derivative works allows you to retain the sole right to alter the work, which allows you to reserve the right to charge money for or prevent alterations. If allowing others to use and improve the work is more important to you that making money from or preventing alterations, then make sure you allow derivative works.
- If you do not care about money, or controlling who is allowed to do what with the work (save put a copyright notice on it), but you do care that the work is free for all to use and modify how they see fit, then make sure the Creative Commons license you choose is a green one, with the ‘Approved for Free Cultural Works’ icon. This will ensure that your work receives the best chance of being re-used and shared by as many people as possible.
There’s plenty of information on the web about storing hierarchical data in SQL using these methods:
- Adjacency list
- Materialised paths
- Nested sets
The method I used in a personal project of mine, however, is different to all of these. Today I found this Evolt article, which pretty much describes the technique I’m using, calling it ancestor tables.
I don’t know if it’s just because I don’t know the right name for it, or if people just generally haven’t thought of it, but finding anybody using this method has been pretty difficult – for whatever reason, nested sets (which I believe has serious flaws) and materialised paths seem to be all the rage instead.
First, I’ll describe each of the alternative methods in brief. More information is available in this article from DBAzine, though you can find an easier to understand description of nested sets in this one from MySQL.
An adjacency list just means that for each node, you also store the ID of its parent node. It’s easy to write a query to find immediate parents or children of an node using this method, but finding a list of ascendents of descendents, including non-immediate ones, requires some sort of fancy recursion. That’s a well acknowledged limitation of this method, and if you look around the web you’ll find a lot of people pointing this out, at the same time singing the praises of nested sets as if they’re the only alternative.
A materialised path means that for each node, you store a string which represents the path to that node. For instance, the node with id 13 might have a path of ‘1.2.12’, meaning that it is the immediate child of 12, which is the child of 2, which is the child of 1. This opens up a few more possibilities in terms of efficient queries that can be made. For example, you can easily find all descendents of an node using a WHERE path LIKE ‘1.2.%’ type of syntax, or just WHERE path=’1.2′ if you only want immediate children. Efficiently finding ancestors is still a bit fiddly, as is moving an node to elsewhere in the table, but it’s not unmanageable. I actually think it’s a good solution.
Nested sets are more complicated than any other method. For each node, you store two integers, which represent a ‘range’. The ‘root’ node of the tree contains the lowest and highest numbers of the whole tree, and each branch contains the lowest and highest number of that branch. It’s probably easiest to illustrate this with a diagram (which I found in this article). Each number between the lowest and highest is used once and only once in the whole tree. The major benefit to this is that it makes finding all descendents of a node fairly efficient. To find children of an node, just find all nodes WHERE leftvalue > parent.leftvalue AND rightvalue < parent.rightvalue. It’s highly inefficient, however, when you only want immediate children, ie only a single level of descendents. It also lets you down substancially when making any modification to any node in the tree; any creation, deletion or moving of an node will always require, on average, half of the rows in the whole table to be updated. Good if the tree is very small or you never plan to update it; bad otherwise.
Variations of nested sets exist which attempt to solve some of its problems, but these tend come at the cost of even greater complexity. I was reading about a method with ever decreasing fractions for increasing levels of the tree earlier.
My ancestor tables method can probably be thought of as similar to a materialised path, in that it requires about the same amount of information, except that it doesn’t concatenate it all together into a string to be stored in a single column value, but represents each ancestor in its own row in a separate relation table:
- ancestor_ID (int)
- node_ID (int)
- level (int)
For each node added to the tree, you add rows to this ancestor table describing its ancestry. So for example, if node 13 is the child of 12, which is the child of 2, which is the child of 1, this would be represented in the ancestor table as:
The total number of rows needed in this ancestor table is related to the number of ancestor-descendent relationships in the whole tree. If your average node is nested only 4 levels away from the root node, then you only need about 4 times the number of nodes. It’s much less even than O(n log n).
(When I do it, I also includes a 0th level for each node, where ancestor_ID equals node_ID and level is 0. There was only one edge case where this helped me for my specific project.)
The method allows for all of the following queries to be efficient, requiring no recursive joins or multiple queries.
- Find the parent of a node:
SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid> AND level=1
- Find all ancestors of its node, including its parent, and each parent in turn:
SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid>
- Find all the immediate children of a node:
SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid> AND level=1
- Find all the descendents of a node, including all immediate children and their descendents:
SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid>
As you can see, none of these queries need recursive joins, or require the database to inspect more rows than they need to, and none of them even require looking up certain information (such as the path to the requested node, or left and right values) before actually doing the query that returns the rows.
Add a LEFT OUTER JOIN to your main node table, and you can fetch all the necessary data about each node (name, properties, etc) in the one query.
You can even do efficient sorting via the same index used to fetch the rows, as long as you add columns to the ancestor tables for whatever data you want to sort on and use indexes wisely.
It also means that when inserting a new node, or making another edit to the tree, you do not have to modify the majority of the tree – only the entries in the ancestor tables that belong to that node. This is similar to the materialised paths technique, where you only need to update the path for the node you change.
Users really appear to love being able to give a ‘thumbs up’ or ‘thumbs down’ to any statement they see on a website.
Strongly disagree with a YouTube comment? Give a thumbs-down! You have expressed an opinion in only a single mouse-click!
The ease of expressing pleasure or displeasure upon someone else’s opinion in a single click seems to be a highly effective way of getting feedback from your users, because it exploits their desire to have their say, at the same time reducing the barrier of entry: typing a reply in words is no longer necessary, neither is logging in, filling out a form, or even visiting a different page.
Harness the crowd’s wisdom
Simple feedback systems like this can even serve as a n0-maintenance extension to your comment moderation: enough down-votes, and your system can be pretty sure, without you even reading it, that a comment is offensive or irrelevant enough to be removed. A YouTube comment with many down-votes appears hidden by default – depending on how many, you may still be able to view it, but it’s highly likely to be offensive or spam. It appears to be pretty effective. Users are willing to do your moderation for you even if they get nothing in return other than the satisfaction of showing their approval or disapproval.
Getting feedback on a blog in the form of comments is very difficult: for every thousand people who read something, a tiny fraction will go through the effort required to fill in their name and write out a proper response, even if you have a comment form that requires no approval or sign-up. If you are writing something highly controversial or offensive, or taking a side on a ‘hot topic’ (Apple sucks, Microsoft is better) you’ll probably find that tiny fraction rise substancially, but otherwise eight hundred people could read a blog post before anyone comments. So, given that it is so hard to get any feedback by comments, why not allow one-click feedback?
What I think of as the YouTube model is not unique to YouTube: Facebook uses the same sort of thing, so does Digg (of ‘digg it’ fame), and my new favourite StackOverflow does the same sort of thing too (though you need reputation to vote), and many others – sadly, sites such as WordPress.com haven’t followed yet. The basic characteristics of this model are:
- One click ‘vote up’ or ‘vote down’ buttons next to comments.
- Clicking them records your vote instantly without a page refresh (Ajax techniques are used).
- There is usually some way that voting something ‘down’ penalises it; it may cause it to move further down the page, or a certain number of down-votes may ‘hide’ it.
I like it so much that when I find myself reading user comments and I can’t give it a thumbs-up or thumbs-down, it frustrates me; I’ve come to expect to be able to give one-click feedback.
The success of Hot or Not and a whole generation of clones showed the addictive popularity of giving users the ability to give feedback with no more intellectual effort than a single click. Instead of a single up-vote or down-vote, however, the user had to choose a value out of ten, and while it only required a single click, it did result in a page load. Nevertheless, people spent hours and hours on sites following that model. While originally they were rating photos of people based on looks, the concept spread to rating all sorts of other things, like graphic design work, poetry, and jokes.
I believe that the thumbs up/down approach takes this two steps further – by reducing the number of available choices down to two instead of ten, and by accepting the feedback without a page reload (due to Ajax techniques).
Years ago I implemented a rating system on a website of my own, making a conscious decision to reduce the number of possible choices from ten down to only three. My belief at the time was that it was a sweet spot, between getting enough useful information from users, and being simple enough so that as many users as possible would use it, because it was such a no-brainer. Adding the voting option under each piece of content did result in participation and increase page views per user. In retrospect, I could have reduced it further to a single ‘up-vote’ and ‘down-vote’, and I suspect the participation rate would have been even higher due to the lower mental effort required. The ‘results’ allowed me to rank items on the site according to popularity; the front page item was always one of the most ‘popular’ in terms of votes.
As I publish this, I just noticed that WordPress.com allows nested comments now – maybe they can allow ratings on comments one day soon!
I’ve been tempted to write why OpenID has been driving me up the wall.
I have not implemented OpenID in any application, so I come at it not as an implementor or programmer but as an end user: a number of sites I’ve used, including Stack Overflow and Sourceforge, have either allowed or insisted upon OpenID authentication.
My first OpenID account was at Verisign Labs (PIP). They’re well established in web security, so I figured it would be a reliable service, and a company that wasn’t likely to disappear on me. Their service, however, left me frustrated for a few reasons.
- For some reason (early onset dementia?), I could never remember my OpenID URL and found myself needing to look it up all the time, which meant starting up my email client. Because it’s not only a username I chose, but also includes the web address of the OpenID provider, I found it easier to forget. I can’t really see ordinary web users finding the URL thing intuitive; for some time now, favourites/bookmarks and search engines have been teaching us that remembering URLs shouldn’t be necessary.
- The Versign Labs PIP has one of the most user-unfriendly features I have ever experienced. With the aim of preventing phishing attacks, a well-meaning goal, it does not allow you to authenticate yourself at any OpenID supported site at all unless you have already logged in directly at Verisign’s website during the same browser session. Try typing in your OpenID to your favourite site, and you get a message from Verisign telling you that no, you haven’t logged in to Verisign this session, so you can’t proceed. When I encouter this, I have no choice but to open up a second tab and head over to their site to log in, except that much of the time I can’t, because I don’t have a browser certificate installed on the computer I’m using at the time (I don’t think it’s abnormal to use more than one computer regularly). So in order to authenticate me, it has to send me an email containing a single-use PIN. Thank goodness my email account doesn’t use OpenID authentication and I can get to that fairly easily. I’ve never had to jump through so many hoops, just to log in to an application I already have an account at.
- Once I’ve started using an OpenID identity from a certain provider on a site or two, it would appear that I am tied to that OpenID provider for life. It makes it very hard to evaluate OpenID providers when your choice is a permanent one. Yes, I realise that it is possible to use delegation, or even to install your own OpenID server, but if we’re going to be talking about end users, neither of these two are really practical, and both of them are likely to result in decreased security.
My second OpenID provider, MyOpenID, appears to be a fair bit easier to get along with, and doesn’t suffer from many of the problems I’d previously encountered.
Simply by opening another OpenID account, however, everything has become exponentially more complicated: if you switch providers, there’s no easy way that I can see to merge all site accounts based on an identity at my previous provider across to the new one. It seems like changing providers may mean ditching a bunch of old accounts and signing up for all new ones. I was impressed at the way Stack Overflow’s implementation allowed switching the OpenID identity associated with my account there. Unfortunately, this flexibility is a result only of Stack Overflow’s thoughtful design, and such a feature is not part of a typical OpenID implementation.
MyOpenID, thankfully, allows me to authenticate myself without having to twiddle around with going to the OpenID provider’s site in a separate browser window or getting a single-use PIN. I suppose it is similar to what the OpenID experience should have been like from the start. Maybe my Verisign Labs PIP account just had too many optional features turned on.
I still find, however, that some things about OpenID underwhelm me:
- Signing up for a new account at an OpenID-enabled site appears no easier when using OpenID. After authenticating with my OpenID URL and whatever authentication I need to do at the OpenID provider’s end, when I return to the client site I still have to fill out a form, and most of the time I still have to confirm my email address. Some fields have been pre-filled by my OpenID account, but I still need to choose a username that is unique to that application, and likely even fill in a Captcha.
- Users are well experienced already with simple username/password combinations. They know, for example, that the password should be kept secret, and it’s that secret that provides their security. Even though they might have several username/password combinations at different sites, this doesn’t make things any more complicated, because the same concept is just repeated. With an OpenID account, however, not only do they now have a username and password at their OpenID provider, but they also have this OpenID URL, and maybe even a browser certificate. That is three or four pieces of information. Furthermore, how will they understand that authenticating with an OpenID URL alone can provide any security, when the OpenID URL is not a secret, and there is no password? I wouldn’t be surprised if users thought that OpenID was grossly insecure, because they don’t understand that all the real security is hidden from them.
- I also wouldn’t be surprised if the idea that their identity is passed between sites made users a bit worried. For instance, how can an OpenID implementor reassure the user that even if they use their OpenID URL to log in and register, that doesn’t mean the implementor now has the password to the user’s OpenID account? All the beneficial security concepts are a black box to the users, who may just assume that the OpenID account is a way for their password and identity to be freely passed around between sites. Far from using it only when high security is needed, we may find that users, unaware of the security benefits to OpenID, only trust OpenID with information they don’t mind losing.
So far I haven’t been convinced that using OpenID is significantly safer – even when comparing it to re-using the same username and password at a whole bunch of different sites, which is itself a dubious security practice. With OpenID, I still have all my eggs in one basket. If an attacker gains access to my OpenID account, he can still impersonate me at all sites where I rely on that identity.
OpenID is a well-meaning idea, and with more experience I am sure that I will master it more, but being this confusing and headache-inducing even to a web developer is a clear indication that it has some way to go before it can be considered fit for general use. Get this: the Wikipedia page for OpenID displays a prominent warning which reads “This page may be too technical for a general audience” applying to various sections, including the section titled “Logging in”. If it is too hard to describe how to “log in” without alienating a non-technical audience, it is a sign that the process is not too usable, and anyone thinking that they are implementing OpenID in order to “simplify” things for end-users may need to think twice.
While some boast about big companies like Google adopting OpenID, it’s not really all that much to crow about – their support is only as a provider, not as an implementor. I cannot, for example, use an existing OpenID to authenticate myself at Google, I can only use a Google ID to authenticate myself elsewhere. Not allowing OpenID authentication themselves doesn’t contibute to the widespread use of OpenID but further segregates it, which is probably just as much of an injustice to OpenID as its indecipherable Wikipedia page.
Now that Gmail offers proper IMAP access for free, I think that there are few reasons not to use Gmail for all my non-work email now.
Gmail’s 7GB (and growing) amount of space allows it to be a ‘store everything’ type of mail box, as opposed to a ‘store what I haven’t downloaded yet’ (as in POP) or ‘store the last x days’ worth’ (as in an IMAP box that’s only small).
My web hosting provider allows POP or IMAP access, but it’s restricted to only 100MB, so it’s not really usable as a ‘store everything’ box, not to mention that I might change hosting providers some day. I really love my host, but the possibility exists that I’ll outgrow them or need some new whiz-bang feature one day.
My current email strategy is:
- Download all mail to my home computer, but have it left on the server for 7 days.
- I can still access at least the last 7 days’ worth of mail when I’m away from home.
- My Gmail account fetches mail from my mailbox via POP every x minutes, so I have another copy of everything on Gmail.
That third point was to be a temporary measure, but I find it just too convenient to be able to search all my mail on Gmail while I am away from home. I might as well forward everything to Gmail.
More points about Gmail:
- Gmail doesn’t force you to use your ‘@gmail.com’ address as your ‘from’ address. You can use an address with your domain name in it as a default. Therefore Gmail does not suffer from the type of ‘lock-in’ – if you move to a new provider, you can keep your email address.
- Gmail’s web interface is better than any web interface I have seen an ISP or a web hosting provider provide. It even rivals desktop based email clients.
- Keeping a copy of everything on Google’s server acts as a really easy, free, form of off-site backup. My current off-site backup strategy consists of burning a DVD of my Thunderbird mail box folder every other month if I remember it, and tucking the DVD into a drawer at work.
The only hesitation I have, but one which I feel is pretty important, is that entrusting all of my email to Google would vastly increase the amount of damage done should an attacker – or a Google employee (unlikely) – gain access to my account. Rather than just 7 days’ worth of emails being available, as with another provider, Google would store an entire history of possibly personal and confidential mail. This includes such secrets as password reminder emails for online services. I’d feel better about it if I could encrypt Gmail’s entire contents with my own key, that Google themselves didn’t have access to, and nor did anyone who had gained access to the account. Of course, it’s not really possible with the way Gmail works.
So, is using Gmail worth it as a ‘store everything’ mail box for personal email?