Here is a summary of the work undertaken as part of the CLOCK project, February-July 2012. As in previous projects, I don’t expect that this will really be the end of this work. We intend to carry on developing tools for publishing, consuming, and playing with Open Bibliographic Data at the University of Lincoln (via LNCD or elsewhere), and I expect us to take CLOCK ideas, data and code further over the coming months.
Thanks to all the people who have contributed to the CLOCK project.
1. Outputs: what has the project produced?
Our major tangible output is a library of code for interrogating and working with multiple, distributed sources of open bib data.
- The PHP source code for all of the prototypes produced by the CLOCK project is available from a GitHub repository – https://github.com/lncd/clock. This is an open, public, “live”, working version of the code which may well be developed and improved over time. (A ‘snapshot’ of the code as it existed on 27 July 2012 has been archived here.) The public code is available to re-use under a GNU Affero General Public Licence. All of the software prototypes have been detailed in the following posts:
- A series of modified cognitive interviews were conducted with several university library cataloguers at Cambridge University Library and the University of Lincoln. The interviews have been summarised here and here. We captured narrated screencapture videos for each of the interviews: these are being held in a private repository and could be mined further investigated for similar projects.
- A number of further blog posts discussing:
- While it is not yet publicly accessible, we have taken significant steps toward the establishing a permanent data.lincoln.ac.uk service at the University of Lincoln. We expect that by late August 2012 this service will be operational and will provide a gateway to the entirety of the bibliographic data and enhancement/manipulation tools produced by CLOCK and its predecessor project (Jerome).
2. Lessons learned
CLOCK reinforced to us that the constraints of a six-month project (effectively 4½ months of development time) are difficult to reconcile with the needs and possibilities for constant experimentation and development around bib data. We also identified the following points:
- The CLOCK model (of presenting different versions of the same bibliographic element for recombination) is feasible and can be modelled in software—albeit with a limited number of sources—even in a limited time; it has potential as a real-world tool for use in libraries. A writeup of this approach can be found here: CLOCK – the localized index model
- Software developers and librarians need to be aware of, and realistic about, the limitations of particular bib data formats. For example, real-time querying of RDFcan be very resource intensive. Be clear about your needs up front. Not every shiny open data format is right for every shiny open data project!
- Related: synchronous querying of SPARQL endpoints is not the way forward! (We characterised this blind alley as “federated searching for the 21st century”…)
- We are a long way away from consistency of approach in the exchange of non-MARC bibliographic information. Every data source we approached in CLOCK required that we develop a different tactic from first principles to query, retrieve and manipulate/execute functions. Our blog posts on the idea of a ‘universal translator‘ explore this problem further.
- Available data is often poorly and confusingly documented. We were regularly frustrated merely in understanding what a data source contained and how it was structured. I (PS) have argued throughout that without some form of centralised registry/gateway (“data.ac.uk/library“) – whether it be managed through external curation or self-submission to a rigorous documentation standard, developers will waste time repeatedly having to interpret the same data again and again. We believe that a national bib data portal would be a great help to centrally index data sources and catalogue their respective schemas. We understand that not everyone agrees with this approach (“who pays?”) and we invite discussion of the alternatives!
- And a couple of practical ‘meta’ lessons about running agile software development projects in libraries:
- Working with a number of developers based in different locations is hard. On a project like this it would be ideal to have a central location that could be used as a development hub. Other projects under the LNCD umbrella have found the same.
- There is the risk of losing ground already made if the outputs of previous projects (e.g. Jerome) are not correctly maintained and curated. This can lead to significant delays when previous work is not available. More rigorous use of development tools—including GitHub and Orbital—is helping to mitigate against this in future.
3. Opportunities and possibilities
We have identified possible continuations and extensions of the CLOCK work; these are detailed further below.
- Continuation of development of the CLOCK software as a tool for (a) faster querying of distributed bib data sources through local indexing of key fields, (b) translating disparate data formats into a common translation standard, (c) meaningfully presenting distributed bib data to the user, (d) allowing a ‘cataloguer’ to select and recombine bibliographic elements to create a new record, which (e) feeds back into a new original data source, making the process iterative, and (f) incorporates social and reputational components in a user’s selection between alternative data elements for the same resource.
- Discussion of the business case to libraries. What value does an alternative resource description model offer, and what would libraries gain through relinquishing some of the control over institutionally-owned catalogue data? Related: could we quantify and qualify the time spent on particular cataloguing / discovery activities using traditional LMSs and demonstrate possible efficiencies or increased quality in a new, distributed model? What arguments for the incorporation of open bib data in cataloguing will convince library managers to replace current practice?
- A more thorough examination of all the application functions that a cataloguer relies upon through more extensive cognitive interviews and/or functional mapping processes with cataloguers at a range of institutions.
- Further investigation of best practice in documenting and describing our own published bib data in JSON/RDF. [Being picked up as part of the data.lincoln.ac.uk work, above]
- We also intend to submit for publication articles on a number of topics arising from CLOCK. Suitable journals and conferences have been identified and articles are being prepared on the following broad topics:
- The approach of the CLOCK project in developing software for working with multiple, distributed sources of open bib data.
- Expectations of truth and ‘trust’ in bibliographic data (“How has this assertation been derived about a work?”)
- The potential of new models of resource description to save time and effort for libraries; techniques for analysing the efficiency of search and cataloguing workflows.