A Little Sparql


Quite often I'll get distracted with an idea/piece of technology and mull it over for quite some time. Most of the time I play through these ideas in my head and don't do anything with them. It's often just an exercise in learning and understanding. I'll read up on something and then try to think about how it works, what I could use it for and so forth.

Occasionally though I'll feel compelled to actually write some code to test out the idea. It's good to prototype things. It's particularly handy for testing out new tech without the constraints of an existing system.

The latest idea that's been running around my head has been triplestores. I've still yet to need to use one, but wanted to get a better understanding of how they work and what they can be used for. Having spent a large number of years in the SQL database world, it seems like a good idea to try out some alternatives - if only to make sure that I'm not shoe-horning an SQL database into something purely because I know how they work.

So rather than do the sane thing and download a triplestore and start playing around with it, I decided to write my own. Obviously I don't have unlimited time so it had to be simple. As of such I decided an in-memory store would still let me learn a fair bit, without being too onerous. Over the past few weeks I've been working on this triplestore and it's now at a point where it's largely functional:

https://github.com/lilspikey/mini-sparql

To be honest writing this has been almost more of an exercise - a coding kata.

A few key points of the implementation:

  • PyParsing for the parsing SPARQL
  • Lots of generator functions to make evaluation lazy
  • Interactive prompt for running SPARQL queries
  • Load triple data from file/stdin (as triples of data in simplified turtle format)

As the data is all stored in memory it's no good as a persistence store, but for quickly inspecting some triple data it works pretty well. Simple point the minisparql.py script at a file containing triple data and away you go:

$ python minisparql.py birds.ttl
sparql> SELECT ?name WHERE { ?id name ?name . ?id color red }
name
'Robin'

PyParsing

PyParsing is a great project for writing recursive descent parsers. It makes use of operator overloading to allow you to (mostly) write readable grammars. You can also specify what objects you want to be returned from parsing - so it's quite easy to parse some code and get back a list of objects that can then execute the code.

It also includes quite a few extra batteries. One of the most powerful is operatorPrecedence, which makes it a piece of cake to create the classic arithmetic style expressions we all know and love. For example, the operatorPrecedence call in minisparql looks like:

expr=Forward()
# code removed for clarity
expr << operatorPrecedence(baseExpr,[
        (oneOf('!'), 1, opAssoc.RIGHT, _unaryOpAction),
        (oneOf('+ -'), 1, opAssoc.RIGHT, _unaryOpAction),
        (oneOf('* /'), 2, opAssoc.LEFT, _binOpAction),
        (oneOf('+ -'), 2, opAssoc.LEFT, _binOpAction),
        (oneOf('<= >= < >'), 2, opAssoc.LEFT, _binOpAction),
        (oneOf('= !='), 2, opAssoc.LEFT, _binOpAction),
        ('&&', 2, opAssoc.LEFT, _binOpAction),
        ('||', 2, opAssoc.LEFT, _binOpAction),
    ])

This handles grouping the various operators appropriately. So rather than parsing:

5 * 2 + 3 + 2 * 4

And getting just a list of the individual elements you instead get a properly nested parse tree, that is equivalent to:

((5 * 2) + 3) + (2 * 4)

Where each binary operator is only operating on two other expressions (just as you would expect under classic C operator precedence).

This meant that implementing FILTER expressions was pretty straightforward.

Generators

Generator functions are one of my favourite features in Python. I was completely blown away by David Beazley's Generator Tricks for Systems Programmers. He starts off with fairly simple examples of generators and shows how they can be combined, much like piped commands in Unix, to create extremely powerful constructs. Therefore it was obvious to me that generators were a good fit for minisparql.

The code for the regular, optional and union joins became pretty straightforward using generators. For example the union join looked like:

def match(self, solution):
    for m in self.pattern1.match(solution):
        yield m
    for m in self.pattern2.match(solution):
        yield m

This simply returns the result of one query, followed by the results of another. The great thing about this is that if we only evaluate the first few values we might not even need to evaluate the second query. The better aspect though is that you can start getting results straight away. A more classical approach would entail building up a list of results:

def match(self, solution):
    matches = []
    for m in self.pattern1.match(solution):
        matches.append(m)
    for m in self.pattern2.match(solution):
        matches.append(m)
    return matches

This code would always execute both queries - even if it wasn't strictly necessary. To my eye the generator code is much clearer as well. yield stands out very well, in much the same way as return does. It's certainly much clearer that matches.append(m) which does not immediately make it clear that data is being returned/yielded from the function.

Anyway, so here's my silly little bit of exploration. I've learnt a fair bit and got to play with a some different ideas.

If I was to take this any further I'd probably want to add some sort of disk-based backend. That way it might be useful for small/medium-sized SPARQL/RDF datasets for simple apps. Much like SQLite for SQL databases or Whoosh for full-text search.