A database specification

Databases are crucial for long-term preservation. In some cases, there will be a specification suitable for mapping from the database to an XML schema and thus giving an XML document (or more) which makes it possible to easily understand the content in the future. An example of this is Record Management Systems that can be mapped towards the CITS for Electronic Records Management Systems (CITS ERMS). But this is not always the case. There will always be databases that need to remain databases. You might think this is an easy task; just save the database dump, and we can just restore it ten years later and continue as before, but I can assure you it is not that easy. Databases need to be transformed into a sustainable format to ensure they can be preserved long term. The eArchiving Building Block uses the SIARD standard (Software Independent Archiving of Relational Database) developed by the Swiss Federal Archives (SFA) and now maintained by SFA and the DILCIS Board. With the help of the available tools, SIARD transforms the database into an XML format. But many other files can be hidden in a database in the form of BLOBs and or CLOBs, which means that the total size of a database can be huge, and the number of files not suitable for transforming to XML enormous. So how do you transform a database with data in from simple values to an XML format with the files extracted and referenced in the XML document? In the transformation, the database content itself becomes XML, and the BLOBs and CLOBs become files referenced in the XML document.

At the same time, we want to transfer this XML document with its files to an archive, which means we want to put it into an information package to include some surrounding information, such as example definitions and explanations of value lists. An information package also adds the possibility of controlling checksums, to confirm that what has been transferred is what has been received. This means that all files are referenced in two XML documents. First, in the XML documents produced by the transformation, and second, in an information package XML document.

The complexity grows. Several questions arise:

Going through the questions, we can see that all XML documents and files can be in one package, but that might give us an XML document that takes 24 hours to validate if the database was filled with files. Then time is needed for checking all the checksums needed to ensure that what was supposed to be transferred arrived without losses or spurious additions.

This means that we need to set up recommendations and rules about splitting the package into more packages, so the information is divided into more than one information package. This is where we are today, making the best recommendations on how to split a huge amount of content into different information packages.