XMLStreamin

Recently I wanted to do some XML processing in Ruby, but a (fairly casual) browse of available packages didn't seem to serve up quite what I wanted. So naturally (:-)), I wrote my own (on top of REXML)… I'm sure the task I had in mind could have been achieved with an existing package, but I'm a bit more used to the philosophy here. In any case, if anyone else is interested, here it is.

XMLStreamin is a small Ruby module that provides a way of reading an XML document as a stream, while letting it be processed acccording to its tree structure much more handily than the usual 'flat' stream reader does. (See below for a downloadable archive, or you can look at the source itself: xmlstreamin.rb.)

Unlike such readers, which typically simply call unspecialized methods for each start-tag, end-tag, text segment, and so on, and leave it to the application to sort out the hierarchy, XMLStreamin uses a pre-built tree of XMLSpec nodes to model the expected document structure.
(The module is fairly basic: it only handles the main hierarchy of the document. No attention is paid to other elements like declarations, as it is not intended as a do-everything parser. As the XMLStreamListener class is derived from REXML::StreamListener, you could add methods to handle such things if needed.)

Each node specifies the actual actions to be taken when an element that it represents is encountered. It can specify what processing needs to be done on the attributes of a start-tag, the handling of included text, and any clean-up actions when the end-tag is read. It also contains a table of the expected sub-elements and their XMLSpec nodes, thus reflecting the document tree.

The central class is 'XMLStreamListener' which extends REXML::StreamListener to provide an interface to a 'tree' of 'XMLSpec' nodes that models the hierarchy of the XML document to be read.

'XMLSpec' is a base class intended to be extended as needed to handle processing for each type of expected element at each level of the XML hierarchy in the document. It has two categories of methods: those concerned with setup (which should not need to be modified) -- 'specs!', 'default!', and 'spec' --, and the handler methods 'start', 'done', 'empty' and 'text', that should be specialized as necessary in derived classes or instances.

XMLSpec nodes are intended to be linked in a tree structure, reflecting the structure of the XML document to be read. Each node has a 'dispatch' (hash) table associating expected tag names with the subordinate XMLSpec nodes that should handle them; the hash table default should reference a node that will handle unexpected tags. If appropriate, you can use a single node to service several different tag names (bearing in mind that the dispatch table will be shared).

There are two predefined global XMLSpecs: '$specXMLVoid', which is a do-nothing basic XMLSpec that can be used to represent elements that you aren't interested in, and '$specXMLFail' which will raise an error if it is invoked.

TO use this module. xmlstreamin.rb should either be in the local directory or the Ruby library path. It can then be loaded with 'require "xmlstreamin"'. It loads "rexml/document" and "rexml/streamlistener" itself. The latter should not be needed outside the module, but XMLStreamListener is invoked via a call to "REXML::Document.parse_stream".

See "xmldemo.rb" and the comments therein for ar small -- contrived -- example of how to use it. RDOC documentation is also provided here and in the archive.

Download (your choice...same content):

GZip'ped Tar archive (xmlstreamin.tgz 14K)
Zip archive (xmlstreamin.zip 38K)
Archive Contents: