(ADS) space is the place

It’s another typically busy day here at the Archaeology Data Service (a UK digital archive for archaeological and heritage data), and despite being a Friday there’s still plenty of tasks to perform before I can head for the weekend. I’ve got a myriad of roles and responsibilities here, so thought I’d keep a running diary of what I think constitutes a typical day.

First, the most important task: making the tea

Morning tea: note ADS themed badge on tea cosy

Next, a quick job, is checking the metadata for a few spreadsheets that came in overnight. Metadata helps us understand the content of a file and is particularly useful for tabular data that may rely on codes, or often have cryptic headings. So in the examples I’m looking at a file recording flint, in it there’s some headings:

  • dent
  • cor

Turns out these are

  • dent=denticuate
  • cor = extent of cortext

Within these fields, codes are used, so in “dent”

  • fl=flake
  • fr=frag
  • mb=microburin

Most things have been recorded, however we need to clarify the following things in  spreadsheet of charcoal:

  • Betula_sp_
  • Salix_Popu
  • Betula__S

Although it’s obvious these relate to tree names, what’s the difference between “Betula_sp” and “Betula__S”?

Next up it’s a meeting to discuss progress on the HERALD project – the redevelopment of the OASIS form. The new form simultaneously delivers simplicity of workflow but also an increased range of options for recording event information (human remains for example). Something I’m also looking at is working with the good people at FISH in incorporating a number of SKOS’d word-lists, in to the new system but also creating a small number of our own to cover specific aspects of archaeological investigation captured by OASIS. We’re also looking at incorporating OS map products (via WMS) to increase the accuracy and speed of spatial recording. An agreement via Historic Environment Scotland means we’ve got permission to use these services (for Scotland) via an OSMA licence. Next, step, clarifying matters for the rest of the UK. 

There’s then the small matter of sending off our monthly reports on progress to Historic England and Historic Environment Scotland, identifying any risks and issues that need to be addressed by the project team, and to make sure the project is running on schedule.

The meeting went on for some time, so just before lunch there’s time to catch up on the cricket score (England 6-down Moeen Ali on 10*) and take the time to finish a small task compiling multi-lingual vocabularies in the for the ArchAIDE project. This is actually pretty exciting.  Learning from the methodology and using the tools developed for the ARIADNE project by the Hypermedia Research Group at the University of South Wales, we’ve then looked to create a neutral spine to which partners from across Europe could map terms used in their pottery recording traditions. As well as linguistic, this is also conceptual and aims to use the power of SKOS hierarchies – explained in a post I wrote here.

We’ve made good progress, and Spanish, Catalan, English, Italian and German (the languages of the main partners) have all been implemented. Out of the blue, I received a lovely email from a ceramicist who volunteered an unpublished glossary of French-English-German terminologies. Having a legendary command of foreign languages**, I’m rationalizing the French terms to our neutral spine. SO for example there are straightforward translations such as “bougeoir” = candlestick, but also mapping very granular classifications to broad terms we can all agree on, for example:

  • bassine
  • bol
  • coupe
  • écuelle
  • jatte
  • terrine
  • vasque

All map as bowls (vessels). So let’s hope it turns out tres bien.

**An “in” joke here; I’m notoriously bad at any language.

Following that,  there’s a Skype call with some of the team from Digital Preservation at Oxford and Cambridge. They’re interested in learning more about our experiences of PDF/A, which some of you may have already read about in a previous Day of Archaeology blog. In short, despite the name PDF/A has to be treated cautiously, and a number of issues about what it does to a file under the hood – compression, substitution of fonts and so on – have to be considered. At the ADS we’re focused on preserving data in the long-term and choice of format (and relative pros and cons) are key. It’s good to hear that the question marks over PDF/A are shared by other high-profile organisations outside of archaeology. As well as (hopefully) helping the team, the call was also good to discuss a future collaboration, based on the practical realities of preserving PDFs, and reporting back to the Preservation community as a whole.

Finally, with England subsiding to a rather mediocre score of 353 all out (century for Stokes, Ali out for 16) it’s time to tackle the last task of the day….

One of my jobs is to think about our storage requirements, and an ever-present issue here is space. As we’re lucky enough to be part of the University of York, space in terms of our NFS, is never that big an issue. We can simply ask (nicely). However, a significant part of being a trusted digital repository is not relying on copies of the data being held on one site. Put simply, if something truly bad was to happen in York, could someone retrieve the data from another copy? We can proudly say “yes”, and have had a long-term agreement with the UKDA (in Essex) to allow us to run a rolling backup (for more information see our Preservation Policy). Space with external parties comes at a cost, so we have to be sensible here,and think about how much space we really need, and to try and model how much we expect our archive to grow.

In recent years we’ve grown significantly in terms of data being submitted (see below), and this has alot to do with the increased volume of digital imagery.

Size of all accessions (bytes) by year at the ADS, 1996-2016

Our procedure is to convert raster images that are deposited with us to uncompressed TIF for long-term preservation (for background on why, see here), so it got me thinking about the inflation (in terms of bytes) from rasters deposited as JPG. I ran a quick(ish) query on our Object Management System (OMS) which holds technical metadata on all of our 2,000,000+ files, comparing the size of an original JPG with it’s TIF equivalent. The preliminary results show that over time the size of the JPG-TIF inflation has grown significantly:

  • In 1997 JPG-TIF inflation was on average x2
  • In 2007 JPG-TIF inflation was on average x3
  • In 2017 JPG-TIF inflation was on average x6

The reasons for this are doubtless the increase in quality of JPEG/RAW images coming out of digital cameras  – put simply the conversion to TIF is removing the compression in the former and there’s more to compress. I’d like to model this further at a future data, looking at the initial size/camera settings of the JPG/RAW, and thinking about how we cope with this increased digital footprint.

However, for now the bell tolls and with South Africa 23-1, but with a decent looking batting lineup, it’s time to call it a day.

If you’ve made it this far, thanks for reading.