2b or not 2b – Aiming for PDF/A

It doesn’t seem five minutes since I was working on the Ipswich backlog archive and writing my 2015 Day of Archaeology post and yet, here I am again – a year later, relating the ins and outs of a day in the life of a Digital Archivist at the Archaeology Data Service.

Being in the mood for reminiscing, having just finished reading my colleague, Tim Evans’ blog about his ten-year anniversary working for the ADS, I was about to launch into a review of everything that I’ve done in the year since my last DoA post.  However, remembering, in the nick of time, that this is in fact the ‘DAY of Archaeology’, you have been spared that and so, to a tale of two tasks.

My first task of the day will be continuing the process of accessioning and archiving the 482 files that have been submitted to the ADS via OASIS.

Grey Literature Library ImageAlong with providing information about archaeological events, archaeologists are encouraged to upload fieldwork reports to the OASIS data capture form which are then archived with the ADS and added to the Library of Unpublished Fieldwork Reports (or Grey Literature Library) for wider re-use. The transfer process is  a fairly complex and involved one, which is described in the blog article: Opening up the Grey Literature Library so I won’t go into it here.  What I can tell you, is that much of my day will be chasing after the precious and, often elusive, validating PDF/A.

Of the 482 files that form this month’s OASIS transfer batch, 462 have been submitted as pdf files, though the versions of pdf and the software and methods of creation vary for instance this batch contained:

  • 14 PDF 1.3 files;
  • 125 PDF 1.4 files;
  • 99 PDF 1.5 files;
  • 155 PDF 1.6 files;
  • 47 PDF 1.7 files;
  • 1 PDF/A-1a file and
  • 21 PDF/A-1b files

As our preservation format is PDF/A, all files that are not already PDF/A need to be migrated, so any that are submitted as PDF/A already can be preserved as they are, saving us time…but are they really PDF/A?

The file profiling tool – DROID, and many validating tools will tell you they are because they identify a PDF/A tag in the file’s XMP metadata. In fact, when you open a PDF file purporting to be PDF/A you also get a helpful blue banner at the top which states that the file complies with PDF/A standard, but this may be deceptive and the file may still not verify in Adobe Acrobat:


According to Adobe Acrobat, only 15 of the 22 files purporting to be PDF/A were verified as such, so, as our procedure is to create PDF/A files that are verified by Adobe, the other 7 will need to be migrated along with the remaining 447 PDFs.  Each migration attempt is followed by a PDF/A validation check, even though our migration software states whether the migration has been a failure or a success.  Where we migrate them to PDF/A-2b, for example, (if they don’t migrate to PDF/A-1b) we will still have to repeat the Adobe ‘Verify Conformance’ feature:




…not 2b

This is likely to take a while.

So, my second task, which will be done in and amongst the grey literature archiving, is to release the dozen or so Southampton archives that I have been working on – these form part of ‘Southampton’s Designated Archaeology Collections‘.

These archives have gone through several stages of work and processes, from accessioning (which involves checking the file formats and contents are readable and suitable for archiving andthat the documentation provided with the data is sufficient to allow discovery, reuse and curation) to interface creation (which involves working to a template created for the Southampton archives, creating thumbnails and images, adding introductory text and ensuring that the files can be downloaded from our file system).  Once the files have been accessioned, migrated, documentation added to our databases and the interface completed; the work is then checked by another Digital Archivist prior to release.

The release stage itself involves a few separate tasks including:

  • a final running of the file profiling tool (DROID) on the collection to ensure that all of the file format information we need is added to our database;
  • adding a ‘release date’ to our Collections Management System;
  • assigning (or minting) a Digital Object Identifier (DOI) (via DataCite)to each collection;
  • updating the ADS ‘Collections History‘ page to include the new release;
  • checking the Dublin-Core metadata and transferring that metadata to Archsearch;
  • creating links, where relevant, to/from the Grey Literature Library and The Geophysical Survey Database.

By the end of the day, therefore, if all goes to plan, there should be around a dozen more collections added to the 987 already available on our website!