Posts filed under ‘Software development’
The fact that attempting recursion in PHP leaves you at the mercy of segmentation faults isn’t anything new, really (though it is new to me now). Since 1999, bug reports to this effect have been knocked back, stating that this is the intended behaviour.
This is all due to technical limitation inside PHP, which I guess is reasonable. PHP pushes to the stack each time you enter a new function call, which is quick and handy, but limits your recusion depth.
What annoys me is that it’s one more reason that PHP is not very good for creating portable code. There’s no warning or PHP error before you exhaust the recursion limit, PHP just crashes (bringing down an Apache process with it if you’re using mod_php). The “safe” depth of recursion you could do on PHP is hard to predict, and is going to vary between installations and builds. I’ve seen posts online showing people doing experiments to try and figure out PHP’s recursion limit, but there’s too many factors at work – how PHP’s been built, whether it’s 32-bit or 64-bit, the type and quantity of variables you are passing as arguments or return values, and so on. In some code of my own, I’ve seen php-cgi crashing after just 20 to 30 levels of recursion. The function was passing some rather large arrays as return values.
You could argue that the same issue is going to happen in a C program, and that any decent programmer should be ready to code around this, anyway. However, in my mind PHP should be a level of abstraction higher than this sort of detail; a programmer in such a high level scripting language should not need to worry about the underlying low-level implementation. Ideally, PHP ought to put out a PHP error, allowing for the problem to be reported gracefully just as if you’d tried to call a non-existant function, and the PHP interpreter should end the script gracefully without crashing any processes.
The bottom line is that if you’re doing recursion in PHP, you should use your own method to limit the depth to a very conservative level, or perhaps rewrite it to use a big loop instead of recursion. If, like me, you’re writing a recursive descent parser, if you need to make your code portable you may be better off to rewrite it to use a big loop.
I’ve come to learn that when someone refers to software as ‘stable’, there is more than one quite different thing they might mean.
A stable software release
A stable software release is so named because it is unchanging. Its behaviour, functionality, specification or API is considered ‘final’ for that version. Apart from security patches and bug fixes, the software will not change for as long as that version of the software is supported, usually from 1 to many years.
Software that is intended for the public to use is usually “stable”. It is released, and following the release no new features are added apart from the odd bug fix. To get new functionality users eventually need to upgrade to the next version. Any problems with the software (unless they can easily be fixed with a bug fix update) are “known” problems, and the software vendor does not need to keep track of more than one variantion of these problems for any given version.
Examples of releases that are the opposite of stable include development snapshots, beta releases, and rolling releases. A characteristic of all three of these is that they are in a frequent state of change; even their functionality and feature list can change from week to week, or day to day. You cannot depend on them to behave the same way from one week to the next.
Some people like that with non-stable releases such as development snapshots, beta releases or rolling releases, they are always getting the latest features as soon as they are written into the software. In many cases, these releases also fix deficiencies or bugs that would otherwise remain stagnant in the stable release. However, with no stability in the feature list or functionality, this affects the ability for documentation, other software that interfaces with the software, plugins, and more to function: a change in the software can mean these become out of date or fail to work anymore. When you have software which needs to work well with a lot of other software, having a stable release reduces the frequency with which changes in the software will break compatibility with the other software relying on it.
Another meaning of stable
Another meaning of stable exists in common use, where people take it to mean “working reliably” or “solid”. That is, people refer to software that runs consistently without crashing as stable. You can see why they may use the word in this way: in real life, when something can be described as stable, it won’t fall over. If a chair is stable enough, you can sit in it and it won’t topple or collapse.
However, confusion arises when people use this form of the word stable to refer to software that isn’t stable in the earlier sense. For example, it’s why you see comments like “I’ve been using the beta version since February and it is very stable” or “the newer version is more stable”. The point that these comments make is not that the software is final and unchanging, as in a stable software release, but more that the software is stable like a chair might be stable. It seems reliable, and the user hasn’t experience any major problems.
This kind of stability won’t help developers extending the software with other software, or writing plugins or customisations for the software, since the fact that at any given time the software is running well does not make up for the fact that the software is subject to frequent changes.
Commenters or reviewers who describe beta or rolling releases of software as stable, might want to try describing them as “solid” or “reliable” instead, to save confusion with a stable release which is an unchanging release. Or, the fact that the same term is understood in two different and sometimes conflicting ways may indicate that the term is not an ideal one in the first place. It does, however, seem firmly entrenched in the software development world, where the meaning of a stable release is well known.
Distributed version control is nothing to be scared of. It may be a lot simpler than a lot of people or ‘introductory tutorials’ have led you to believe.
- Distributed version control doesn’t force you into a new way of working. If you prefer a centralised approach, you can still do that. You could use it very much like you currently use svn if you want to, and yet you’ll still get the benefit of having a local history of revisions, allowing you to do a lot of things such as revert, diff, or even check in some code locally – without the overhead of network access.
- Distributed version control does not force you to have a mess of different branches of your code on different developers’ machines, with no clear indication as to which one is most ‘up to date’. You would still have a ‘main’ or ‘trunk’ branch on a server somewhere which represents the main focus of development. Developers get the added option of having new branches on their own machine, but this doesn’t mean that they can’t send that code back to some shared trunk when they finish something.
- The benefits to distributed version control do not end at the ‘coding while on a plane’ scenario. While this is indeed a benefit of distributed version control (while coding without network access you still have access to the complete revision history and can do commits), this is not the sole reason behind DVCS. You also get the benefit of reduced network overhead, making typical operations much faster, and you get a lot more flexibility in choosing a suitable workflow, and altering that workflow when it suits you. If you are working on your own pet feature but don’t want to check in your code to the main trunk until it’s done, you can still easily use revision control on your own little copy, and when it does come time to check in and merge all your changes to the trunk all the little revisions you made when you were working on it separately are preserved. This requires no setting up on the central server – any developer can create their own little parallel branch for a while, then merge it back up – it can be as simple as doing a ‘local commit’ instead of a ‘commit’ or ‘check-in’ (for me, that’s just one checkbox). Merging may sound complicated, but it’s not – it simply works. The system is smart enough to figure out which file was which and you can’t really break it – anything is reversible.
- Not all DVCS are git, or particularly like git. Git originated as a special-purpose version control system which Linus Torvalds developed to support his own personal workflow, managing the Linux kernel. While it now has pretty much all the features of a full-featured version control system, it still has its origins as a tool built especially for one person’s own workflow. Undoubtedly it is a good system, but if you try git and don’t like it, you should not assume that other distributed version control systems are the same as it. Git also has problems on Windows (a native Msys version is still being developed as of this blog post).
My version control system of choice at the moment is Bazaar, chosen because I need both Windows and Linux capability and I like that it is easy to use. There is not much separating it from Mercurial except for some small, almost philosophical conventions. Just as one almost insignificant example, Bazaar versions directories, allowing empty directories to be significant. I’d recommend either. Bazaar has the massive support of Canonical (of Ubuntu fame) and big projects such as MySQL. Mercurial has big projects such as Mozilla (of Firefox fame). You can get near instant help for either by going to their respective freenode IRC channels or by asking a question on Stack Overflow.
- If you are interested, try out this “Bazaar in five minutes” tutorial. Longer tutorials often give the impression that distributed version control is more complicated than it really is.
‘Some people, when confronted with a problem, think “I know, I’ll use XML.” Now they have two problems.’
– stolen from somewhere
- DOM is a standard, language-independent API for heirarchical data such as XML which has been standardized by the W3C. It is a rich API with much functionality. It is object based, in that each node is an object.DOM is good when you not only want to read, or write, but you want to do a lot of manipulation of nodes an existing document, such as inserting nodes between others, changing the structure, etc.
- SimpleXML is a PHP-specific API which is also object-based but is intended to be a lot less ‘terse’ than the DOM: simple tasks such as finding the value of a node or finding its child elements take a lot less code. Its API is not as rich than DOM, but it still includes features such as XPath lookups, and a basic ability to work with multiple-namespace documents. And, importantly, it still preserves all features of your document such as XML CDATA sections and comments, even though it doesn’t include functions to manipulate them.
SimpleXML is very good for read-only: if all you want to do is read the XML document and convert it to another form, then it’ll save you a lot of code. It’s also fairly good when you want to generate a document, or do basic manipulations such as adding or changing child elements or attributes, but it can become complicated (but not impossible) to do a lot of manipulation of existing documents. It’s not easy, for example, to add a child element in between two others; addChild only inserts after other elements. SimpleXML also cannot do XSLT transformations. It doesn’t have things like ‘getElementsByTagName’ or getElementById’, but if you know XPath you can still do that kind of thing with SimpleXML.
The SimpleXMLElement object is somewhat ‘magical’. The properties it exposes if you var_dump/print_r/var_export don’t correspond to its complete internal representation, and end up making SimpleXML look more simplistic than it really is. It exposes some of its child elements as if they were properties which can be accessed with the -> operator, but still preserves the full document internally, and you can do things like access a child element whose name is a reserved word with the  operator as if it was an associative array.
You don’t have to fully commit to one or the other, because PHP implements the functions:
This is helpful if you are using SimpleXML and need to work with code that expects a DOM node or vice versa.
PHP also offers a third XML library:
- XML Parser (an implementation of SAX, a language-independent interface, but not referred to by that name in the manual) is a much lower level library, which serves quite a different purpose. It doesn’t build objects for you. It basically just makes it easier to write your own XML parser, because it does the job of advancing to the next token, and finding out the type of token, such as what tag name is and whether it’s an opening or closing tag, for you. Then you have to write callbacks that should be run each time a token is encountered. All tasks such as representing the document as objects/arrays in a tree, manipulating the document, etc will need to be implemented separately, because all you can do with the XML parser is write a low level parser.
The XML Parser functions are still quite helpful if you have specific memory or speed requirements. With it, it is possible to write a parser that can parse a very long XML document without holding all of its contents in memory at once. Also, if you not interested in all of the data, and don’t need or want it to be put into a tree or set of PHP objects, then it can be quicker. For example, if you want to scan through an XHTML document and find all the links, and you don’t care about structure.
There’s plenty of information on the web about storing hierarchical data in SQL using these methods:
- Adjacency list
- Materialised paths
- Nested sets
The method I used in a personal project of mine, however, is different to all of these. Today I found this Evolt article, which pretty much describes the technique I’m using, calling it ancestor tables.
I don’t know if it’s just because I don’t know the right name for it, or if people just generally haven’t thought of it, but finding anybody using this method has been pretty difficult – for whatever reason, nested sets (which I believe has serious flaws) and materialised paths seem to be all the rage instead.
First, I’ll describe each of the alternative methods in brief. More information is available in this article from DBAzine, though you can find an easier to understand description of nested sets in this one from MySQL.
An adjacency list just means that for each node, you also store the ID of its parent node. It’s easy to write a query to find immediate parents or children of an node using this method, but finding a list of ascendents of descendents, including non-immediate ones, requires some sort of fancy recursion. That’s a well acknowledged limitation of this method, and if you look around the web you’ll find a lot of people pointing this out, at the same time singing the praises of nested sets as if they’re the only alternative.
A materialised path means that for each node, you store a string which represents the path to that node. For instance, the node with id 13 might have a path of ‘1.2.12’, meaning that it is the immediate child of 12, which is the child of 2, which is the child of 1. This opens up a few more possibilities in terms of efficient queries that can be made. For example, you can easily find all descendents of an node using a WHERE path LIKE ‘1.2.%’ type of syntax, or just WHERE path=’1.2′ if you only want immediate children. Efficiently finding ancestors is still a bit fiddly, as is moving an node to elsewhere in the table, but it’s not unmanageable. I actually think it’s a good solution.
Nested sets are more complicated than any other method. For each node, you store two integers, which represent a ‘range’. The ‘root’ node of the tree contains the lowest and highest numbers of the whole tree, and each branch contains the lowest and highest number of that branch. It’s probably easiest to illustrate this with a diagram (which I found in this article). Each number between the lowest and highest is used once and only once in the whole tree. The major benefit to this is that it makes finding all descendents of a node fairly efficient. To find children of an node, just find all nodes WHERE leftvalue > parent.leftvalue AND rightvalue < parent.rightvalue. It’s highly inefficient, however, when you only want immediate children, ie only a single level of descendents. It also lets you down substancially when making any modification to any node in the tree; any creation, deletion or moving of an node will always require, on average, half of the rows in the whole table to be updated. Good if the tree is very small or you never plan to update it; bad otherwise.
Variations of nested sets exist which attempt to solve some of its problems, but these tend come at the cost of even greater complexity. I was reading about a method with ever decreasing fractions for increasing levels of the tree earlier.
My ancestor tables method can probably be thought of as similar to a materialised path, in that it requires about the same amount of information, except that it doesn’t concatenate it all together into a string to be stored in a single column value, but represents each ancestor in its own row in a separate relation table:
- ancestor_ID (int)
- node_ID (int)
- level (int)
For each node added to the tree, you add rows to this ancestor table describing its ancestry. So for example, if node 13 is the child of 12, which is the child of 2, which is the child of 1, this would be represented in the ancestor table as:
The total number of rows needed in this ancestor table is related to the number of ancestor-descendent relationships in the whole tree. If your average node is nested only 4 levels away from the root node, then you only need about 4 times the number of nodes. It’s much less even than O(n log n).
(When I do it, I also includes a 0th level for each node, where ancestor_ID equals node_ID and level is 0. There was only one edge case where this helped me for my specific project.)
The method allows for all of the following queries to be efficient, requiring no recursive joins or multiple queries.
- Find the parent of a node:
SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid> AND level=1
- Find all ancestors of its node, including its parent, and each parent in turn:
SELECT ancestor_ID FROM ancestors WHERE node_ID=<nodeid>
- Find all the immediate children of a node:
SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid> AND level=1
- Find all the descendents of a node, including all immediate children and their descendents:
SELECT node_ID FROM ancestors WHERE ancestor_ID=<nodeid>
As you can see, none of these queries need recursive joins, or require the database to inspect more rows than they need to, and none of them even require looking up certain information (such as the path to the requested node, or left and right values) before actually doing the query that returns the rows.
Add a LEFT OUTER JOIN to your main node table, and you can fetch all the necessary data about each node (name, properties, etc) in the one query.
You can even do efficient sorting via the same index used to fetch the rows, as long as you add columns to the ancestor tables for whatever data you want to sort on and use indexes wisely.
It also means that when inserting a new node, or making another edit to the tree, you do not have to modify the majority of the tree – only the entries in the ancestor tables that belong to that node. This is similar to the materialised paths technique, where you only need to update the path for the node you change.
I have seen the GPL software license described as a cancer before, and this analogy does indeed hold, even though it misses the point. If you want to borrow any code released as “GPL” for use in your own source code, any of your own code that the GPL is mixed with will also need to become GPL if ever you plan on distributing software. However, this rule of the GPL is actually a very good thing, especially for hard working open source developers, because it ensures that their work can never be exploited by companies wishing to make a buck, unless those companies give their contributions to the code back under the same license, effectively becoming part of the same open source development effort.
The LGPL is mostly the same as the GPL except that it releases some of these restrictions. I am not a law expert, but I have come to the following understanding. In particular, it allows for code to be combined in the same project with code from any other license including proprietary licenses, under certain conditions. If the whole thing, including the LGPL code as well as the other code, is distributed together, then it becomes known as a ‘combined work’ under the terminology of the LGPL. The distribution must abide by the following (a complete list exists in the text of the license under ‘Combined Works’):
- There must be some sort of clear separation between the LGPL code and the other code. In particular, it must be possible for the recipient to modify the LGPL code or even completely replace it with other code, such as a modified version or later version of the libary. Therefore, if it is a compiled program, then the LGPL code must either be dynamically linked (like, a separate DLL or shared file) such that it could easily be substituted for a similar library and still be interoperable; or, if it is statically linked, the minimum required source files and/or object files must be provided, to allow recompiling with an alternative library. The non-LGPL portion may not contain any part of the LGPL code except for very simple header files.
- It must be clearly pointed out which part of the code is LGPL covered, including its original copyright notice and the text of the LGPL (including the GPL on which it is based).
- If the combined software displays copyright notices during the course of running, then the copyright notice for the LGPL covered portion must also appear here, along with a link to the LGPL and GPL.
- In some cases, you would need to provide installation information detailing how to use a modified version of the LGPLd code in the combined application.
The above is just a list of stuff that you have to do if you are going to redistribute some LGPL code as part of your non-LGPL application. However, the text of the LGPL doesn’t really explain the benefits, or what opportunities it opens up to you. If you are distributing a combined work which consists of both someone else’s LGPL code, and other code that is under a different license, you benefit from:
- You don’t need to release source code for the rest of your application (ie, the non-LGPL part). The only exception to this being as described above – if it’s all statically linked then you’d need to provide just enough code (and/or object files) to be able re-link it with an alternative or modified version of the LGPL code. If you are linking dynamically and interacting over a normal API, you don’t need to worry about that.
- You don’t need to release the rest of your application under the GPL. You can use any license you want, including more restrictive proprietary licenses, provided that when you do distribute it you follow the rules.
- Unlike GPL version 3, which prohibits using the code if you are implementing copy protection or DRM software, you can use LGPL version 3 licensed code in an application which includes copy protection or DRM.
So, LGPL is a fairly permissive license in comparison to the GPL, which cannot be ‘combined’ with other proprietary licenses in this way. So, what can’t you do with LGPL? Well, apart from relaxed rules such as those already discussed, LGPL still contains all of the other restrictions of the GPL. I won’t go into all of the detail of the GPL here, but here are some of the restrictions that you still retain when using the LGPL, that you may not retain with an even more permissive license (such as a BSD or X11 license):
- Anybody wishing to modify and redistribute LGPL code must also make the source code of the LGPL code, including their modifications, available and licensed under the LGPL. Therefore, the open source community and the original author can still benefit if a commercial enterprise contributes improvements to the LGPL code.
- All existing copyright notices in the LGPL code, as well as the GPL and LGPL licenses themselves, must be prominently displayed in any redistribution. In addition to this, any application using the LGPL code must display such copyright notices and refer to the license whereever they display their own copyright notices.
Something else of note is that whether you are building software that relies on either GPL or LGPL code, if you distribute that software without any of the GPL or LGPL included (ie, the user needs to download and install it themselves), then it’s not considered a ‘combined work’ and you don’t need to worry about complying with any terms. However, if you include even so much as a header file from the GPL or LGPL code in your software, then you’ll be bound by the license terms of the GPL or LGPL.