to make data archives that last is an urgent task. That's the message of the European Commission's eArchiving initiative, which has just announced version 2.0 of its architecture and that its funding has been renewed for another two years.
Under the tutelage of the commission, the initiative will define processes – using open formats and metadata – that mean organizations won't have to keep old IT equipment hanging around just in case they need it to read old data.
“There are a number of problems when you want to restore very old data,” said Gregor Završnik, a researcher at the University of Slovenia in Slovenia, who is a consultant in geospatial data archiving and a member of the eArchiving initiative. “For sure, you have to be able to read the storage media and read the file format – but there is worse. When you have finally extracted data from an Excel table, you don't have the context.
“So, you don't know what the numbers you have restored correspond to. How were they collected? With what level of precision? Are they authentic?” he added, when talking to French sister site LeMagIT during a recent IT Press Tour event.
The eArchiving initiative builds on the E-Ark project, which is a community of developers that has worked since 2014 to create universal and perennial tools to validate, reformat and archive data. The key challenge is to make archives interoperable via common encoding but also to conforming to regulatory needs.
From researcher project to European initiative
“At the start of E-Ark, we imagined we'd create a universal format for archiving,” said Završnik. “But as we progressed, we realized these archives are mostly kept by those who created the data originally, and that everyone thinks that this data will be commercially valuable even way in the future. So, what we need is to create a standard that allows an enterprise to restore its own archives after several years.”
A key challenge, however, has been that the E-Ark project has struggled to bring together the big players in storage and backup. It is made up of a dozen teams, but these are overwhelmingly from the world of research.
The challenge at the level of the European Commission is that to transform E-Ark into eArchiving, the technical content of the project needs to become an accepted standard in the market. A key early stage is that the universal archive format imagined by E-Ark is standardized and will correspond to the new revision of ISO 14721, the reference model for an open archival information system.
“If the commission demands that the public sector in the EU adopts our archive format, it can't oblige enterprises to do the same,” said Završnik. “But it can say to them that if they use an open format, they won't be locked in for eternity to a technology that necessitates use of commercial tools. And what's more, it will allow free exchange of data between each other.”
CSIP format allows for specialized metadata
The file format proposed by the initiative is Common Specification for Information Packages (CSIP), which has its own dedicated portal for those wanting to convert data to a perennial archive format or for software houses that want to implement it in products.
“The format is free of any commercial licensing and is documented and structured to be able to be re-read, freely useable in whatever software, allowing for a unique numeric ID for each archive and definition of dependencies to other data,” said Završnik.
LeMagIT understood this to be data dependencies related to Linux packages, or software that triggers third-party libraries needed to function, such as when a land registry archive needs to work with mapping from another archive.
CSIP is implemented via a management platform known as OAIS (Open Archival Information Package). That comprises tools to convert source data using SIP (Submission Information Package), to preserve it after reformatting via AIP (Archival Information package), and to redistribute it with only the data required for a particular profession or application using DIP (Dissemination Information Package). .
Each sub-format has its own particular metadata. For example, DIP has metadata that allows for archive contents to be used in medical (file), commercial (SQL), architectural (3D modelling) or cartographic (vectorised imagery) contexts.
The new version, v 2.0, brings improvements in the detail of the format. Notably, this sees the categorization of metadata into six groups: strategy, business, application, technology, implementation and migration. For each of these there are the settings: passive structure, behaviour, active structure and motivation.