module timelinelib.general.xmlparser

A simple, validating, SAX-based XML parser.

Since it is simple, it has some limitations:

  • It can not parse attributes

  • It can not parse arbitrary nested structures

  • It can only parse text in leaf nodes: in other words, this piece of XML is not possible to parse: <a>some text <b>here</b> and there</a>

Here’s an example how to parse a simple XML document using this module.

First we create a file-like object containing the XML data (any file-like object is fine, but we create a StringIO for the purpose of making a working example):

>>> from io import StringIO
>>> xml_stream = StringIO('''
... <db>
...     <person>
...         <name>Rickard</name>
...     </person>
...     <person>
...         <name>James</name>
...         <age>38</age>
...     </person>
... </db>
... ''')

Then we define two parser functions that we later associate with Tag objects. Parse functions are called when the end tag has been read. The first argument to a parse function is the text that the tag contains. It will be empty for all tags except leaf tags. The second argument is a dictionary that can be used to store temporary variables. This dictionary is passed to all parse functions, providing a way to share information between parse functions.

>>> def parse_name(text, tmp_dict):
...     tmp_dict["tmp_name"] = text
>>> def parse_person(text, tmp_dict):
...     # text is empty here since person is not a leaf tag
...     name = tmp_dict.pop("tmp_name")
...     age = tmp_dict.pop("tmp_age", None)
...     print("Found %s in db." % name)
...     if age is not None:
...         print("%s is %s years old." % (name, age))

Next we define the structure of the XML document that we are going to parse by creating Tag objects. The first argument is the name of the tag, the second specifies how many times it can occur inside its parent (should be one of SINGLE, OPTIONAL, or ANY), the third argument is the parse function to be used for this tag (can be None if no parsing is needed), and the fourth argument is a list of child tags.

>>> root_tag = Tag("db", SINGLE, None, [
...     Tag("person", ANY, parse_person, [
...         Tag("name", SINGLE, parse_name),
...         Tag("age", OPTIONAL, parse_fn_store("tmp_age")),
...     ]),
... ])

The parse_fn_store function returns a parser function that works exactly like parse_name: it takes the text of the tag and stores it in the dictionary with the given key (tmp_age in this case).

The last step is to call the parse function with the stream, the tag configuration, and a dictionary. The dictionary can be populated with values before parsing starts if needed.

>>> parse(xml_stream, root_tag, {})
Found Rickard in db.
Found James in db.
James is 38 years old.

The parse function will raise a ValidationError if the XML is not valid and a SAXException the if the XML is not well-formed.

timelinelib.general.xmlparser.SINGLE = 1
timelinelib.general.xmlparser.OPTIONAL = 2
timelinelib.general.xmlparser.ANY = 3
exception timelinelib.general.xmlparser.ValidationError[source]

Bases: Exception

Raised when parsed xml document does not follow the schema.

class timelinelib.general.xmlparser.Tag[source]

Bases: object

Represents a tag in an xml document.

Used to define structure of an xml document and define parser functions for individual parts of an xml document.

Parser functions are called when the end tag has been read.

See SaxHandler class defined below to see how this class is used.

__init__(name, occurrence_rule, parse_fn, child_tags=[])[source]

Initialize self. See help(type(self)) for accurate signature.

add_child_tags(tags)[source]
add_child_tag(tag)[source]
read_enough_times()[source]
can_read_more()[source]
handle_start_tag(name, tmp_dict)[source]
handle_end_tag(name, text, tmp_dict)[source]
class timelinelib.general.xmlparser.SaxHandler[source]

Bases: xml.sax.handler.ContentHandler

__init__(root_tag, tmp_dict)[source]

Initialize self. See help(type(self)) for accurate signature.

startElement(name, attrs)[source]

Called when a start tag has been read.

endElement(name)[source]

Called when an end tag (and everything between the start and end tag) has been read.

characters(content)[source]

Receive notification of character data.

The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.

timelinelib.general.xmlparser.parse(xml, schema, tmp_dict)[source]

xml should be a filename or a file-like object containing xml data.

schema should be a Tag object defining the structure of the xml document.

tmp_dict is used by parser functions in Tag objects to share data. It can be pre-populated with values.

timelinelib.general.xmlparser.parse_fn_store(store_key)[source]
timelinelib.general.xmlparser.parse_fn_store_to_list(store_key)[source]