This is the end (again) – “final” project blog post

Posted on July 31st, 2012 by Paul Stainthorp

Here is a summary of the work undertaken as part of the CLOCK project, February-July 2012. As in previous projects, I don’t expect that this will really be the end of this work. We intend to carry on developing tools for publishing, consuming, and playing with Open Bibliographic Data at the University of Lincoln (via LNCD or elsewhere), and I expect us to take CLOCK ideas, data and code further over the coming months.

Thanks to all the people who have contributed to the CLOCK project.

1. Outputs: what has the project produced?

Our major tangible output is a library of code for interrogating and working with multiple, distributed sources of open bib data.

  1. The PHP source code for all of the prototypes produced by the CLOCK project is available from a GitHub repository - https://github.com/lncd/clock. This is an open, public, “live”, working version of the code which may well be developed and improved over time. (A ‘snapshot’ of the code as it existed on 27 July 2012 has been archived here.) The public code is available to re-use under a GNU Affero General Public Licence. All of the software prototypes have been detailed in the following posts:
  2. A series of modified cognitive interviews were conducted with several university library cataloguers at Cambridge University Library and the University of Lincoln. The interviews have been summarised here and here. We captured narrated screencapture videos for each of the interviews: these are being held in a private repository and could be mined further investigated for similar projects.
  3. A number of further blog posts discussing:
  4. While it is not yet publicly accessible, we have taken significant steps toward the establishing a permanent data.lincoln.ac.uk service at the University of Lincoln. We expect that by late August 2012 this service will be operational and will provide a gateway to the entirety of the bibliographic data and enhancement/manipulation tools produced by CLOCK and its predecessor project (Jerome).

2. Lessons learned

CLOCK reinforced to us that the constraints of a six-month project (effectively 4½ months of development time) are difficult to reconcile with the needs and possibilities for constant experimentation and development around bib data. We also identified the following points:

  1. The CLOCK model (of presenting different versions of the same bibliographic element for recombination) is feasible and can be modelled in software—albeit with a limited number of sources—even in a limited time; it has potential as a real-world tool for use in libraries. A writeup of this approach can be found here: CLOCK – the localized index model
  2. Software developers and librarians need to be aware of, and realistic about, the limitations of particular bib data formats. For example, real-time querying of RDFcan be very resource intensive. Be clear about your needs up front. Not every shiny open data format is right for every shiny open data project!
    • Related: synchronous querying of SPARQL endpoints is not the way forward! (We characterised this blind alley as “federated searching for the 21st century”…)
  3. We are a long way away from consistency of approach in the exchange of non-MARC bibliographic information. Every data source we approached in CLOCK required that we develop a different tactic from first principles to query, retrieve and manipulate/execute functions. Our blog posts on the idea of a ‘universal translator‘ explore this problem further.
  4. Available data is often poorly and confusingly documented. We were regularly frustrated merely in understanding what a data source contained and how it was structured. I (PS) have argued throughout that without some form of centralised registry/gateway (“data.ac.uk/library“) – whether it be managed through external curation or self-submission to a rigorous documentation standard, developers will waste time repeatedly having to interpret the same data again and again. We believe that a national bib data portal would be a great help to centrally index data sources and catalogue their respective schemas. We understand that not everyone agrees with this approach (“who pays?”) and we invite discussion of the alternatives!
  5. And a couple of practical ‘meta’ lessons about running agile software development projects in libraries:
    • Working with a number of developers based in different locations is hard. On a project like this it would be ideal to have a central location that could be used as a development hub. Other projects under the LNCD umbrella have found the same.
    • There is the risk of losing ground already made if the outputs of previous projects (e.g. Jerome) are not correctly maintained and curated. This can lead to significant delays when previous work is not available. More rigorous use of development tools—including GitHub and Orbital—is helping to mitigate against this in future.

3. Opportunities and possibilities

We have identified possible continuations and extensions of the CLOCK work; these are detailed further below.

  1. Continuation of development of the CLOCK software as a tool for (a) faster querying of distributed bib data sources through local indexing of key fields, (b) translating disparate data formats into a common translation standard, (c) meaningfully presenting distributed bib data to the user, (d) allowing a ‘cataloguer’ to select and recombine bibliographic elements to create a new record, which (e) feeds back into a new original data source, making the process iterative, and (f) incorporates social and reputational components in a user’s selection between alternative data elements for the same resource.
  2. Discussion of the business case to libraries. What value does an alternative resource description model offer, and what would libraries gain through relinquishing some of the control over institutionally-owned catalogue data? Related: could we quantify and qualify the time spent on particular cataloguing / discovery activities using traditional LMSs and demonstrate possible efficiencies or increased quality in a new, distributed model? What arguments for the incorporation of open bib data in cataloguing will convince library managers to replace current practice?
  3. A more thorough examination of all the application functions that a cataloguer relies upon through more extensive cognitive interviews and/or functional mapping processes with cataloguers at a range of institutions.
  4. Further investigation of best practice in documenting and describing our own published bib data in JSON/RDF. [Being picked up as part of the data.lincoln.ac.uk work, above]
  5. We also intend to submit for publication articles on a number of topics arising from CLOCK. Suitable journals and conferences have been identified and articles are being prepared on the following broad topics:
    • The approach of the CLOCK project in developing software for working with multiple, distributed sources of open bib data.
    • Expectations of truth and ‘trust’ in bibliographic data (“How has this assertation been derived about a work?”)
    • The potential of new models of resource description to save time and effort for libraries; techniques for analysing the efficiency of search and cataloguing workflows.

2 Responses to “This is the end (again) – “final” project blog post”

  1. [...] [A commentable version of this blog post can be found on the CLOCK project blog.] [...]

  2. A fascinating project which I’m happy to have been involved with. As well as exploring some interesting areas, CLOCK (and by proxy this post) has exposed and rationalized some of the frustrations and concerns I’ve developed working with open bibliographic data over the past couple of years. The new models for copy cataloguing are exciting and something I hope to see revisited.

    I’d strongly echo the requests for a national level resource for open bib data, preferably built around an existing aggregation and as permissively licensed as feasible. Registries are useful but sites such as the dataHub can at a pinch suffice. With regards to standards, we are all operating ahead of the LOC Marc21 transition initiative, but I’m firmly of the opinion that there will not be one preferred model or standard for bib data in the library sphere going forward.

    In terms of disappointments, documentation of data models for existing data sources was a bit of a barrier for the developers, who were from a non-bibliographic background. This is a bit of a shame given the target audience of the DIscovery project is a wider technical group. Cambridges’ Comet data was as guilty of this as anyone. I’d recommend that any future data release builds sold documentation on the data model into its

    A few comments on the opportunities outlined:
    1) I’d like to see development continue somehow, especially around the proposed index model. I think such work would be necessary to drive any follow-on discussion and debate around points 2 and 3. The current development is impressive, but not yet a full realization of the original ideas.

    4) Looking forward to seeing this. In the post discovery project landscape, (too) few institutions are doing open data under their own steam. A public-ally realized understanding of the costs and rationale would be interesting.