Many publishers want to monetise their books and reference content by making these materials available online. And a common strategy to drive sales is to combine book and journal content within a single platform; synergies between different types of content should drive discovery and increase usage.
Some publishers choose to adapt an existing off-the-shelf journals platform to meet their needs for consolidation. However there can be technical problems inherent in this approach.
In a recent post to the CrossTech blog, publishing guru Geoffrey Bilder analysed the issues facing CrossRef members wishing to use the DOI system for non-journal content. Geoff’s analysis holds good for more than just the CrossRef DOI system, so I’ve taken it as a starting point below.
Reference publications and databases introduce fundamental challenges to any existing system designed around the journal article content model. These challenges fall into two areas:
- Structure. Reference works and databases can have complex nested substructures and there is a need for granular identification of these content substructures along with a mechanism for recording the relationship between them (e.g. “part-of” relationships between sub-section, section and chapter divisions, as well as “previous-next” navigational relationships between entries).
- Versioning. Unlike most journals, many reference books and databases change over time. To properly support archiving and perpetual access business models, there is a need to identify and maintain previous versions of reference content.
Both of these areas require support from the core architecture of any publishing system; without such support publishers have to resort to ugly work-arounds such as coercing all content into journal-article-shaped chunks. Such work-arounds fundamentally compromise the end-user experience and ultimately risk devaluing the publisher brand.
Furthermore the technology stack commonly used to build a typical journals system does not necessarily provide a good base to build support for the structural and versioning requirements outlined above. This is because the mixture of technologies used tends to match closely the typical structure of a journal article.
In an article the metadata and content are two separable parts which map nicely to a relational metadata store (typically SQL or RDF) plus a full text retrieval engine (Lucene/Solr, Autonomy, etc). Although the full text may contain semantic annotations such as chemical or gene markup, the homogenous nature of the content means that a single metadata schema can be devised which will fit the whole collection well.
However, when publishers want to add non journal content to the system this simple separation of content and metadata will no longer suffice. Reference and book content is much more demanding in terms of hierarchical structure and navigation. Many hierarchical structures in reference books cannot easily (if at all) be modelled using conventional relational metadata; these structures fit much more naturally and easily into native XML. Trying to create a single relational metadata schema capable of modelling all hierarchies across the full variety of reference and book content that publishers produce is an impossible task.
The structural problems of adding books and reference content to systems designed to separate full text and metadata can be broken down into two distinct areas:
- Fragile schema problem. The problem of updating conventional database schemas as requirements (and the real world) change is often called the fragile schema problem. Because the design of the database must be fixed at the outset, any changes to the system, to accommodate extra metadata, new content types or changes to business process become costly and risk-laden. This is because assumptions about the structure or schema of a database tend to be hard-wired into the structure of all the software written to talk to the database.
- Divorce of metadata from full text. Systems which use RDF or SQL databases for metadata suffer from a fundamental structural weakness as it is normally impossible to issue queries which examine both the full text and metadata within a single request. RDF suffers particularly in this respect as it has no support for XML mixed content; full text search plays no part in the RDF world view. Furthermore, such systems also often require content to be ‘loaded’ into several different internal content stores (e.g. once for full text index, once to ‘strip’ metadata to the RDF/SQL store). This builds a structural inefficiency into the system as content must be found, queried and updated in multiple locations.
When we design the architecture for our publishing platforms we like to think hard about these problems. Often we conclude that XML database technology (such as MarkLogic Server) provides the best solution because it allows content to be stored and queried in a single place. Using an XML database removes the artificial split between content and metadata inherent in conventional journals systems and allows us to build search queries across an entire collection of content. This in turn helps to drive discovery and usage by building deeper links between related content and combining and assembling content from an entire collection in new ways.
However we also realise that different clients have different needs and that it is critical to prioritise meeting business goals ahead of making specific technology choices. The motivation for our overall technical architecture choice in a given project is simple; to learn the lessons of past systems and ensure we choose the best possible technology basis for any new system. Understanding the problems publishers have faced in trying to adapt legacy journals systems to the more challenging world of books and reference helps us to make the most appropriate technology choices for future publishing platforms.
