In my last post I introduced you to the universal translator. We left this with a few issues regarding the data pulled back and how we could get it into a format that would be workable across multiple sources, as well as the actual comparison function itself. Thankfully we’ve come up with a solution, and that is to push all the data into nested arrays. So, within CLOCK, a translated element would look something like this:
- Description 1
- Description 2
- Description x
- ISBN 1
- ISBN x
- Creator 1
- Creator x
Hopefully you get the idea. This guarantees that we will always find the data in the same format when we come to do our comparison.
The real meat of this particular module is to combine our translation function across multiple data sources, giving us the ability to compare against a default source and show where we have differences in data. So, say we use Cambridge as a default source. We’d compare every subset of data from Harvard and Open Library against every subset of data in Cambridge and output any items of data that are not present.
It sounds complex but, with the translation data in a standard format, all we need to do is provide a chunk of data, a list of existing sources and what data indexes to expect, and away our function goes. You can see the output here: http://clock.lncd.org/comparemydata/compare.php - use the links to change the default institute.
What can we take away from this; well, the comparison is rather broad – you’ll notice it identifies grammatical issues that might not immediately be apparent, like errant fullstops. It does highlight issues where we have non-standardised data, for example the authors being stored in three different ways, the ISBN’s having additional characters which are not needed. Yes, we could filter this to give a more direct comparison but that’s not the point – we need to show these and allow cataloguers to decide whether they are issues and alter their databases acordingly.
Which brings us neatly to the “Where next?” section of this post – databases! At the moment we’re running this off flat files, csv separated documents. I want to start looking at database structures that would support the local indexing we’re looking at, as well as our translation and set up files. Exciting times all round! There’s certainly more to come from CLOCK yet!