ISO 28500:2009 - a new standard for the WARC file format

Paul Boughton

A web page that is here today may not be here tomorrow. However, a new ISO standard, ISO 28500:2009 (Information and documentation – WARC file format) will ensure that the vast amount of often valuable information posted on the web is not lost when a page changes or disappears.
 
ISO 28500 provides a file format known as WARC (Web ARChive), which offers a convention for combining multiple data objects into one long file. The format can be used to build applications for harvesting, managing, accessing and exchanging content.
 
Clément Oury, a member of the working group that developed the standard, says: "For a long time, keeping track of the staggering number of web sites and pages posed a difficult challenge for digital curators and archivists, and resulted in countless lost data.
 
"With WARC, ISO 28500 takes Internet archiving to the next level by enabling the effective management, structure and storage of billions of resources collected from the web and elsewhere. Its standardisation offers a guarantee of durability, and will help web archiving become part of the mainstream activities of heritage institutions and other branches by, for example, fostering the development of new tools and ensuring interoperability between collections."
 
The WARC format is an extension of the ARC file format that has been used by the Internet Archive since 1996, as well as by numerous heritage institutions to store 'web crawls' – which represent extracts of entire web pages and their links. The motivation to extend the ARC arose from the discussions and experiences of these organisations within the International Internet Preservation Consortium (IIPC), whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC members were finding it increasingly difficult to store and manage the growing volume of information coming from the Internet.
 
WARC format differs from ARC in that it offers new possibilities, notably the recording of HTTP request headers and of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, whether retrieved by HTTP or another protocol.
 
Mr Oury adds: "Several applications are already WARC-compliant, such as the Heritrix crawler for harvesting, the WARC tools for data management and exchange, the Wayback Machine, Nutchwax and other search tools for access."
 
ISO 28500: 2009 (Information and documentation – WARC file format) was developed by ISO technical committee ISO/TC 46, Information and documentation, subcommittee SC 4, Technical interoperability. The standard is available from ISO national member institutes or directly from the ISO Central Secretariat, price 118 Swiss francs, or through the ISO Store.
 
For more information, visit www.iso.org

Recent Issues