In our adventures with CLOCK we’ve taken a look at a number of provided open data endpoints. The prototype we’re building is being developed on the philosophy that we’re trying to get as many and varied sources of data as possible into it to draw from; the idea being from this that we could take a lateral view of a given record across all of these sources. What does a record for “Watchmen” by “Alan Moore” look like in Lincoln? How similar is this to the record held by Cambridge? What about the British Library? This is the basic goal for our prototype; a tool that can compare and note similarities and differences in bibliographic data.
In theory, that should be relatively straightforward however we’re finding that in practice our ideas on how this could be done are possible somewhat flawed. The initial thought was that we could probably look at currently available SPARQL endpoints, including the Cambridge catalogue and the BritishLibrary. While we have something that kinda/sorta/maybe works, we’re finding that the process of querying the SPARQL points leaves something to be desired. They’re not quick; there are dissimilarities between the schemas; it’s not always obvious where the endpoint is.
This has led us to question whether our approach in using SPARQL is right? If we’re aggregating content is there a better way? I’m starting to see that the initial simple idea is potentially vastly more complex than simply hitting a number of endpoints and building content ad-hoc. We’re now talking localised indexes and I’m going to have to take a look at linked data approaches.
[...] A blog post by Andrew Beeken of the JISC CLOCK project reports dissatisfaction with SPARQL endpoints and linked data, and provoked responses from other users of linked data: “What is simple in SQL is complex in SPARQL (or at least what I wanted to do was) … You see an announcement about Linked Data and don’t know whether to expect a SPARQL endpoint, or lots of embedded RDF.” Chris Keene [...]
[...] data to make the application work. It isn’t necessarily the only way of using linked data (anyone for federated SPARQL queries?) but I think that ‘crawl, index, analyse’ is an approach to building applications we [...]
I posted this on Twitter but worth reposting here I think.
There are a number of approaches to consuming linked data described in the Linked Data Book by Tom Heath and Christian Bizer – and of particular relevance is the section on architectures for linked data apps – http://linkeddatabook.com/editions/1.0/#htoc84
The three architectural patterns described are:
The Crawling Pattern
On-the-fly dereferencing
Query federation pattern
The book goes on to reference http://dx.doi.org/10.1007/s13222-010-0021-7 (paywall but think same as http://www2.informatik.hu-berlin.de/~hartig/files/Hartig_QueryingLD_DBSpektrum_Preprint.pdf) for more detail.
I haven’t looked at this second article in detail, but I think the descriptions of the three patters in Heath & Bizer are worth looking at.
I nodded a lot when reading both the post and the comments so far.
I too played with the Cambridge and BL SPARQL endpoints for the Discovery competition (until moving on to something else) and found: what is simple in SQL is complex in SPARQL (or at least what I wanted to do was); it was slow; the two datasets were so different that trying to use them together was non-trivial.
I’ve also noticed there are two sides of Linked Data / RDF : SPARQL and processing Linked Data embedded in URIs and webpages (example of the latter http://graphite.ecs.soton.ac.uk/#dump ). Perhaps it is me, but the two can feel like separate worlds. You see an announcement about Linked Data and don’t know whether to expect a SPARQL endpoint, or lots of embedded RDF.
I was about to say something long winded but realised Ed said it better “SPARQL seems most useful for our use context as a tool to describe an entity rather than as a means of discovery.”
So, to use a Library example, RDF might be a good replacement for MARC21, but, like MARC, you would then expect to index it elsewhere (solr and others) for discovery.
@Chris – totally! This is the direction we’re looking at heading in now, creating indexed records to streamline the search. I will say I’m a bit of an open data virgin, so a lot of the discoveries I’m making on this project are real eye openers!
This is something I tried here: http://www.aurochs.org/lodopac/lodopac.php
SPARQL was fine for retrieving known items/ elements but less so for actual searching, the speed bring crushingly slow, at least the way I did it.
As we discussed, we might be in danger of slipping into a old pit that libraries have only recently dug themselves out of, federated search. Multiple SPARQL or even json rest/api endpoints being queried (a)synchronously begins to look like a lot like multiple Z39.50 connections.
SPARQL seems most useful for our use context as a tool to describe an entity rather than as a means of discovery.
[...] post on the CLOCK blog from me: Who watches the open bibliography data standards? Like this:LikeBe the first to like this [...]