All I want for Christmas is … RDF and XQuery

This post is fundamentally a response to a really interesting one by Dave Kellogg that I read a few weeks ago titled positioning MarkLogic Server, so to get the most of out of it you’re going to have to check Dave’s out first. The diagram on the left, a variant on Dave’s original, shows how different types of data and content technologies square up. While I liked Dave’s analysis, I also felt there was something missing that I wanted to tease out here, so I’ve made a few modifications.

Firstly I’ve changed the two axes of the original diagram to Data (or content) complexity and Query complexity, which is what it seemed to me he was really getting at. Then I’ve annotated the diagram with some important query technologies that, for me, he passed over.

Let’s take a look at these more closely.

Squaring data and content query technologies

LDAP databases are an often overlooked category, but nevertheless an important one. LDAP is high performance, but it suffers from a rigid tree structured data model, and an inflexible query architecture.

Z39.50 metasearch technology is really a placeholder on my diagram for all of the Information Retrieval systems and search engines we know and love, from Lucene, through to Autonomy and Google. These IR systems index a relatively wide variety of content, but deliver little in the way of expressive query capability.

More importantly I’ve added the missing RDF technology piece to the chart, in the form of the SPARQL query language. It’s the comparison between XML Databases (such as MarkLogic) using XQuery, and RDF databases using query technologies like SPARQL that I think is quite illuminating.

Warning for technophobes, here: we’re going to have to plunge into the detail just a little!

One weakness in XML databases is that adding new data types means going back and re-writing all your XQuery to cope with new element names and namespaces in the new data types you add. For instance, a global query to list all <title> elements in alphabetical order might need to be modified when a new data type is added that uses a <head-title> element to encode the same information.

This type of fragile query problem is completely avoided in RDF databases by using sameAs relationships to assert that differently named properties convey the same information. This problem can also be solved in a more general way by creating class and subclass relationships which allow the different types to be kept separate whenever needed and grouped together whenever needed.

This gives RDF databases significantly more power in terms of query complexity than XML databases.

However, RDF databases know nothing of document structure, and more importantly conventional information retrieval techniques for full text search and relevancy ranking are not part of any RDF standard yet. In point of fact these have only recently been added to the formal XQuery standards, but XML database vendors have supported their own proprietary versions for some time.

This gives XML databases significantly more power in terms of text search then RDF databases.

And here lies the challenge, especially for developers like Semantico who have to deliver solutions for publishers needing the attributes of both systems.

All I want for Christmas is the best of both worlds

In the past we needed to glue together an unholy alliance of full text search technologies with relational SQL databases to store metadata. This was in many ways a problem because such systems suffer from the baked-in, brittle nature of the SQL schemas combined with the simplistic query models afforded by the full text search components.

What we really need is a technology that combines the query power of the RDF model with the full text and structural abilities of the XQuery model: the best of both worlds. Until we get that, we will be gluing together RDF databases and XML databases in the same way we did full text and SQL in the past.

So, if you’ll forgive me for ending on a seasonal note, Dear Santa …

2 Responses to All I want for Christmas is … RDF and XQuery

  1. Great post, I like your extension of the original diagram.

    I wanted to point out that MarkLogic Server (I work there) provides fields… so hopefully this is an early Christmas present :) . Fields group different elements together so they can be searched as one. This is similar to the RDF sameAs example. When a new element comes along that should be part of an existing query it can be added to the field definition. No change to the query required.

    In fact, if the XML elment indicated it should be part of a specific field the element could be automatically added to the field definition during ingestion – an advanced technique but very powerful.

    All of this goes beyond standard XQuery support. I am not sure what other XML database providers are doing in this area but I would not be surprised if this became relatively standard over the long run.

  2. Hi Jason,

    The fields feature in MarkLogic is definitely a step in the right direction. And it does provide an ability similar to the sameAs OWL predicate in RDF.

    RDF provides a whole array of extra capability in this area, and it is still on my wish list to be able to take both views on content or data at the same time, without duplicate stores and complicated parallel query type approaches.

    Do you have any plans to push the W3C to look at the fields feature in new developments to the XQuery standard? Whilst as a MarkLogic partner we applaud your inclusion of the feature, it would be great to see it making the standards track too.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>