book_and_mouse

The huge and constantly changing range of file formats used to store data is one of the biggest challenges to successful archiving

The information required for archiving is generated, managed, stored and distributed on computer systems. The use of paper during the active life of information is almost at an end and increasingly key data is now digital. However, the range of formats this data is held in is wide ranging and challenging. This can be summarised as follows:

  • Document based information in common formats. Traditional records management was concerned with sets of documents that could be printed, signed, copied and distributed as paper. Electronic systems used files, but the focus on document based information remained. The two main format families used today are the Microsoft Office formats: Word, Excel, PowerPoint etc, and the Adobe formats, PDF and PDF/A. Many people consider these static, that is that a print of the information is an adequate representation of its content. However this is increasingly not the case – features such as hidden text and change history make the digital copy more useful than its paper equivalent. Further to this, digital documents have behaviour that is important such as animations, macros and the values in a formula that are critical. Also, documents may contain embedded files that contain extra information that is not printed.
  • Specialist formats. Much information in industry or research is held in specialist formats that are much less common. This includes CAD files containing complex 3D diagrams that cannot be printed and are used interactively on the screen. Others include the outputs from instruments and lab records, all of which are held in digital formats that may be cross-industry but are proprietary and complex.
  • Local data formats. Some formats are particular to each industry, for example standard models and measurements. The use of these may be associated with software that is supported by a single organisation and that may change significantly from version to version.
  • Compound data formats. One complication that will increase as computer systems become more complex and interactive is the interrelationship between files. A CAD design for example is built up from multiple files that add layers of complexity. These files may be designed using different packages. Another example is a multimedia object which may comprise multiple video and sound streams plus text notes that are built into a single presentation. These must be used as a single unit, but managed separately. This can become more complex when a low level object, for example a common component, is used in many different places.
  • Web based information. A web site is to some extent a good example of a series of compound data objects – each file can be made up of HTML plus images, scripts and style sheets which must be managed as a single unit. Increasingly, websites are just a representation of a database of dynamic information which is continuously changing.
  • Container formats. Often, data is held in containers for convenience – ZIP, TAR, GZIP etc. When considering future access it is the files within these containers that are important as well as the container itself. Some other formats such as Microsoft Office can contain embedded files making them both content files and containers. In all cases this can be multi-layered, for example a TAR file containing a ZIP file, containing a MS Word file, which itself has an embedded MS Excel spreadsheet.
  • Open databases. A great deal of information is held in databases, and many of these are “open”, that is the table names, column meanings and behaviour are understood and documented. These databases still need an application to make them useful and these applications can be highly complex, but the data itself is accessible.
  • Proprietary databases. Much information is stored in closed databases where direct access to the data is not allowed, for example finance and HR applications or workflow systems. Access may be allowed via export systems but these can be complex to use, and the content is always changing.
  • Email. Increasingly the backbone of corporate life, email systems can be considered a specific type of proprietary database. They contain text, attached documents and metadata relating to who sent what to who and when. Often they are integrated with a corporate workflow system, e.g. MS Exchange or Lotus Notes and also refer to accreditation or certification systems that prove who sent what and when.

For archiving for any significant length of time, it is critical to choose a system that can handle the diverse data sources and formats you wish to retain.

© Copyright 2012 Tessella plc