Simple xml
This article presents the source code for a really simple xml parser, just under 100 lines of C/C++ code. Many things tend to revolve around xml these days. It looks like pretty much everyone involved with file formats, configuration files, RSS feeds, and countless other uses of markup languages tend to want to use xml for that matter. Of course, there is a lot of trendiness in that, often a regular comma-separated data file will do just fine. Yet you need an xml parser anytime you are required to read data vehicled with xml. For that matter, full-fledged parsers which also happen to include xslt engines and the like are absolutely overkill. MSXML, Apache Xerces, or the .NET Xml parser are three bad examples of that. Actually, rely on a simple xml parser will be just fine. Worse than that, nobody in practice uses the entire set of capabilities of xml, like processing instructions and entities. Most of the time, people will use UTF-8 as the encoding since it's the best combination between universality and character size, but that's not even a requirement. In some cases, people don't use attributes at all. After all, attributes are special kind of element children or, the opposite, direct element children can be regarded as attributes as well. There is still that on-going debate between attributers and elementers... Finally, it was some kind of a personal challenge to come up with a parser that wouldn't be larger than say 100-200 lines of code. After all, why should xml require anything more if xml is *that* simple?
Features and limitationsThe code reads regular xml with the following limitations :
With that in hand, it looks like our parser is pretty dumb. Nevertheless, it will meet your needs for the simple reason that all of the limitations above are features often disregarded anyway. The xml parser produces a document object model (DOM), that is a hierarchy of tree nodes which a client code can navigate.
Implementation detailsThe parsing process does really nothing more than keeping track of reserved symbols as < and >. Whenever the parser is on top of a < symbol followed by a /, it stores the value that may have been declared, as in The result of the parse is entirely encapsulated by the node hierarchy. At each level of the node defined below, we find an arbitrary amount of siblings and children. And then we have a parent node. struct Node { typedef enum NodeType { elem = 0, attr }; NodeType type; std::string name; std::string value; LPNode parent, child, sibling, attrib; Node(NodeType t) : type(t) { name = value = NULL; parent = child = sibling = attrib = NULL; } ~Node() { delete child; child = NULL; delete sibling; sibling = NULL; delete attrib; attrib = NULL; } };
How the parser can be used is really straight forward : #include "simplexml.h" char* xml = "..."; // some xml fragment // parse the xml fragment LPNode dom = ReadXml(xml); if (!dom) return FALSE; // dump the node tree DumpDom(dom,0); // remember to delete the resulting dom delete dom;
HistoryAugust 10, 2004 - First implementation. The parser does not support attributes and comments.
Stéphane Rodriguez - Oct 12, 2006. |
Home Blog |