Archives are repositories of information that is retained for an extended period, and then retrieved and decoded.
All information needs to be coded in order to be stored. The coding may be as simple and unstructured as a free-form printed text (coded as a specific language and character set), or as complex as a website front-end and a linked SQL database. From this, it follows that there must be three processes:
- Encoding the data
- Storing it
- Decoding it.
We’ll look at the challenges of each in turn.
At the point at which the data is encoded and stored, there may be only a very small probability that any particular part of the data will ever be looked at – the payback to society may be small and a long time in coming. It may be possible to guess some of the uses to which the data will be put, but many of the most valuable ways in which it may be used may be impossible to foresee. A value judgement must be made as to what is and isn’t worth keeping – but it may be impossible to build a convincing business case for saving anything.
This implies that the cost of encoding per byte of information must be negligible; that is to say, the data probably has to be stored in whatever format it happens already to be coded – special coding for preservation is unlikely to be justifiable. Since we do not know how the data will be used, any attempt to modify the coding is just as likely to be unhelpful as helpful. Furthermore, any re-coding or re-packaging is likely to result in at least some degradation of the data.
Encoding frequently (usually) neglects the metadata. For example, to fully and correctly interpret the data in a database table that has been populated manually, one must know the question which was asked, the response to which is stored in the database. If proper weight is to be put on the reliability and accuracy of the data, we should also record when and of whom it was asked and under what circumstances. If this metadata is recorded digitally – and it often isn’t – it is likely that over time, it will get separated from the data to which it refers.
Even a dataset that has been acquired entirely automatically will have garnered its data from sensors that have imperfect characteristics and which will rely on other parameters remaining constant which may in fact not have done so.
In summary, information about how the data was gathered is an essential part of interpreting it correctly and avoiding unwarranted assumptions.
The good news is that encoding of the data happens at a time when the data has the maximum amount of resources and attention devoted to it, and therefore this is the stage at which it is most possible to take extra steps to ensure successful preservation of the data. The bad news is that everyone is focused on the present task and the immediate deliverable, and preservation has a very low priority.
Retention and Preservation
The data may be stored for very long periods (years certainly, decades probably, centuries possibly) and therefore the cost of storage per year must be very low. The preservation must not depend on the longevity of the organisation looking after it, nor must it demand frequent attention.
The traditional bound paper book is a great Data Preservation medium, and it’s been around a long time. The encoding (typesetting) may be a complex process that is opaque to most of us, but once printed and bound, the medium can remain in storage for hundreds of years with little or no attention. The environmental requirements are really limited to maintaining reasonable humidity and avoiding bookworm.
Compared with the book, all digital media fare very badly. In comparison with the cost of looking after a book, the cost of storage of digital media is many orders of magnitude higher. Just how high is difficult to calculate (say in terms of pounds per gigabyte per year), because a byte is not necessarily of equal value to any other byte. A high-resolution photograph may need 10Mb, and be of little value or interest to anyone, but Shakespeare’s entire works would fit into 10Mb easily. If we have more space, we will keep more.
Much of the world’s current digital data exists only on server farms, where it is held on disks that typically have to be kept spinning at 10,000rpm 24 hours/day, 365 days/year, demanding a constant supply of electricity, air conditioning, engineers, software patches, backups.
Although some of these server farms run on very creative revenue models (Facebook, Google Mail) the cost of the operation is real enough, and this must be covered by revenue from some other source – if not the users, then from advertisers, or from companies that will pay for intelligence gathered by mining the stored data or user profiles.
Sadly, offline data is only a little better – every form of digital media has its issues:
- Tapes have to be re-spooled annually to ensure that the oxide doesn’t come off
- Disconnected hard disk drives must be spun up annually to ensure they still work.
- Recordable CD-ROMs have very questionable longevity – the expectation is that they will last over ten years, but no-one really knows.
During the retention period, any unexpected degradation will only be spotted if regular tests are made to ensure the data is readable. This also requires effort.
All of these digital assets require someone to keep an eye on them – 24/7 for online data, and at least annually for offline.
Someone needs to pay for these resources somehow, and unlike a book, if the payment dries up, the data will be lost.
Decoding a book requires no technological equipment – only an understanding of the written language. As an extreme example, the Dead Sea Scrolls were forgotten for at least 2000 years, and are still readable today (admittedly, with difficulty). We may be confident that there will still be books around in a hundred years’ time, and that most of them will still be readable. The metadata for a book is all contained in the flyleaf.
Decoding digital data is not so simple or resistant to the march of progress. By the time decoding is required, it’s a fair bet that technology will have moved on. Assumptions that were made about the availability of certain tools for decoding are unlikely to be borne out.
Much digital data is stored in proprietary formats, which can only be read by software of a certain antiquity written by a certain organisation. Many things can happen in time, any one of which will make the data unreadable:
- The organisation chooses no longer to support this format in its latest software release, and earlier releases will not run under current operating systems on current hardware
- The licence to use the software expires, and a new licence cannot be justified or paid for
- The company ceases to trade, and no-one else supports the format
- The physical media cannot be read on any available hardware (remember 8” floppy disks?)
- The media has degraded with time to the point where it is unreadable
A complete set of the metadata for decoding a digital dataset would have to include:
- All the technical drawings and documentation needed to recreate the hardware on which the media is to be read
- The data coding format to be decoded
- All the technical drawings and documentation needed to recreate the computer platform on which the program to interpret the data is to be run
- Copies of all the original software
This is a tall order, and as far as I know, it has never been achieved in practice.
What can be done?
Here are a few steps to consider:
- Include a very easy-to-read file containing a description of the contents of the dataset, so that a future researcher can tell whether it is likely to be worth the effort to decode it.
- Use open public data standards that many companies support.
- If it exists and copyright permits, include a simple read-only data reader and display program with the dataset
- Publish the data as widely as possible (subject to confidentiality constraints, of course), so that as many people as possible assume responsibility for preserving it.
- As far as you can, build all the intelligence into the encoding process, so that the decoding is as trivial and technology-independent as possible. If possible, make the data self-documenting, so it takes the minimum of understanding to interpret it – even if this results in increased file size.
- Build redundancy into the dataset, so that future users can cross-check the dataset for internal consistency. This will have several benefits:
- Any corruption will be detectable through failed self-consistency checks
- Users will be able to confirm their understanding of the structure through seeing the data match where they expect it to
- Build and test a working virtual shell of the computer system used to encode and read the data. Better still, mothball the actual machine too.
- Document the metadata as carefully and thoroughly as possible, and associate this with the dataset in whatever way is least likely to result in the separation of the two.
- Retain and control responsibility for preserving the data, ideally within a public body that is unlikely to disappear overnight. For example, a university is preferred to a commercial company, and either is preferable to some anonymous free hosted service that could be withdrawn at any time.
- If you can justify it, whenever some technological development happens that could make a preserved dataset unreadable, convert the dataset to the new format.
- If this can’t be justified, then at least create or identify a “Rosetta Stone” showing how the mapping from the old to the new should being done, if it ever turns out to be necessary. (This might be a format translation program or similar.) A copy of this program should be preserved with the data. Then anyone wishing to review part of the data in the future can run the translator on the area of interest. If the time period is long, they may have to run several translators in series to render the data usable.
The urgency of the issue
So far, thanks to the workings of Moore’s Law, our data storage capacity has increased as fast as we have created digital data. We have been able regularly to buy a new hard disk with double the capacity, and copy all the old data onto it. Experts disagree about how long Moore’s Law can hold good, but it is certain that there will come a point in the not too distant future when it will break (it will be pleasantly surprising if it can last another decade), and we will have to look seriously at what digital data is worth keeping and what should be thrown away.
Already today, most of the essential metadata is not being recorded, and it is certainly not being inseparably connected to the datasets to which it refers.
If nothing changes, it is easy to imagine the historian or researcher of a century hence saying, “Society has excellent written records until the end of the Twentieth Century, but then the historical record, at least as far as we can read it, more-or-less dries up.”
Is this the legacy we want to leave to future generations?
Chris Moller, 07Feb2010.