Solution

The Managed Migration of information to more modern formats is a key defence to the challenge of format obsolescence.

Identification

The first step in an active preservation pathway is to find out what form the information is held in. This must be done automatically – you cannot rely on the users having the time or knowhow to do this. The UK National Archives have started providing tools in this area by developing PRONOM, their database of known file formats. This integrates DROID, a tool to look at the byte patterns of the files and thus enable format identification. The next step is to validate the file format, using tools specific to the file format identified. This is often part of the identification process as multiple possible formats are put forward, but only one of these is validated. The range of validation tools is limited and includes proprietary programs as well as open source tools such as JHOVE from Harvard University.

Technical information

The next step is to extract technical information on what you have identified. This includes file based characteristics (for example for Microsoft Word the number of pages and flags for “password protected”, “contains hidden text” and “contains change history”). Some of these characteristics are specific to a format or family of formats (e.g. “contains VB Macros” for Microsoft Office documents) but others cross between formats (e.g. the colour histogram of an image). Again the tools for this are limited but include open source tools such as JHOVE, library routines that can be called and professional programs.

Units of Information

The next and more complex task is to assemble these files into the logical units of information that need to be maintained. This is an important process since files are technology based but the real units of human-interpretable information that need to be maintained are often multi-file constructions (and indeed the number and structure of files may vary as technology moves on). For example, a GIS map could consist of many files in one technology, but be aggregated into a single file in the next generation of formats. We are not interested per se in preserving one file but rather the information overall. Having identified these information objects, their “essential characteristics” are also measured, since these are the properties that we want to ensure are maintained by future preservation actions.

Records at risk

At the end of this process the information is correctly identified and its characteristics are known. The next step is to identify those formats that are at risk. PRONOM provides lists of software that can read and write the formats it knows about and whether these are supported. A key risk factor is a format being held that is no longer supported. Other risk factors can include the file-based technical characteristics that were measured above, such as password protection and non-standard data structures. Using this information together it is possible to identify those files that should be targeted for preservation actions.

Preservation Action

They two key strategies currently recommended are migration and emulation. Migration involves the moving of the data to formats currently supported, for example moving Word 2.0 to Word 2007. Alternately you may move it to a different format family, for example Word 2.0 to PDF 1.4. These have their challenges, for example Word to PDF may lose hidden text so any migration has to be validated and errors identified. This requires extracting the characteristics of the migrated file and comparing with the original to identify changes. This comparison can be simple (is the page count the same?) or more complex (is the image colour histogram the same?).

As described above, when migrating information it is important to move beyond the file view and to migrate logical units of information. The best example of this is web pages where migrating image files will result in broken links. It is thus important to follow up migrating files by migrating other files within the same logical components that depend on them, for example changing the links in the HTML to point at the new image file. Also, containers must be recognised and the files within them migrated and replaced, leading to a new copy of the container file. This can lead to a cascade of migrations from one original action.

The result is a new “manifestation” of the information being managed. This terminology is important – it is not a new version as it is intended to convey the same meaning. A manifestation may combine some files that have changed and some that have not resulting in a many-to-many relationship between information objects, manifestations and files.

Extending OAIS

The challenges discussed above extend the OAIS model to include Preservation Action as well as Preservation Planning. It is also important that as much as possible of this process is automated, as in very large data stores it will become too complex to migrate information individually. To that end all the identification, validation, characterisation and migration tools must be deployable at run time within a configurable system that allows actions to be run with minimum human intervention.

© Copyright 2014 Tessella. Cookie Policy