A 2002 study that produced two reports (feasibility study, and related legal issues) pertaining to web archiving.

Web-archiving: a feasibility study for JISC and the Wellcome Trust


End date: 1 February 2004

Funding programme: Digital Preservation and Records Management Programme

JISC theme(s): Information environment, e-Administration

Purpose of study  

In March 2002, the Wellcome Trust and JISC awarded a contract to UKOLN to undertake a feasibility study into web archiving. The aims of this study were to provide the Wellcome Trust and JISC with:

  • an analysis of existing web archiving arrangements and to determine to what extent they address the needs of the UK research and FE/HE communities
  • recommendations on how the Wellcome Trust and the JISC could begin to develop web archiving initiatives to meet the needs of their constituent communities.

Recognizing that the legal implications of web archiving - copyright, data protection, defamation etc - are a key concern of any would-be web archivist, UKOLN contracted the Centre of IT and Law at the University of Bristol to undertake a separate study into the legal issues of web archiving. 

To validate the findings of both studies an international Advisory Board, comprising representatives from the British Library, National Library of Medicine, Library of Congress, Internet Archive and the National Library of Australia, was established. The reports were circulated to this group in early December 2002 and without exception, the findings and recommendations were fully endorsed. See the two reports

Why collect and preserve the web?

  • In the short time since its invention, the World Wide Web has become a vital means of facilitating global communication and an important medium for scientific communication, publishing, e-commerce, and much else. The 'fluid' nature of the web, however, means that pages or entire sites frequently change or disappear, often without leaving any trace.
  • In order to help counter this change and decay, web archiving initiatives are required to help preserve the informational, cultural and evidential value of the World Wide Web (or particular subsets of it).

Why should the Wellcome Library be interested in this ?

  • The Wellcome Library has a particular focus on the history and understanding of medicine. The web has had a huge impact on the availability of medical information and has also facilitated new types of communication between patients and practitioners as well as between these and other types of organizations. The medical web, therefore, has potential long-term documentary value for historians of medicine.
  • To date, however, there has been no specific focus on collecting and preserving medical websites. While the Internet Archive has already collected much that would be of interest to future historians of medicine, a preliminary analysis of its current holdings suggest that significant content or functionality may be missing.
  • There is, therefore, an urgent need for a web archiving initiative that would have a specific focus on preserving the medical web. The Wellcome Library is well placed to facilitate this and such an initiative would nicely complement its existing strategy with regard to preserving the record of medicine past and present.

Why should JISC be interested in this?   

JISC has a number of areas where web archiving initiatives would directly support its mission. These include:

  • JISC funds a number of development programmes. It, therefore, has an interest in ensuring that the web based outputs of these programmes (e.g. project records, publications) persist and remain available to the community and to JISC. Many of the websites of projects funded by previous JISC programmes have already disappeared.
  • JISC also supports national development of digital collections for HE/FE and the Resource Discovery Network (RDN) services that select and describe high-quality web resources judged to be of relevance to UK further and higher education. A web archiving initiative could underpin this effort by preserving copies of some of these sites, e.g. in case the original sites change or disappear. The expertise and subject knowledge of the RDN could in turn assist development of national and special collections by bodies such as the national libraries or Wellcome Trust. These collections would be of long-term value to HE/FE institutions.
  • JISC also funds the JANET network used by most UK further and higher education institutions and, as its operator, UKERNA has overall responsibility for the ac.uk domain.

Collaboration

  • Collaboration will be the key to any successful attempt to collect and preserve the web.
  • The web is a global phenomenon. Many attempts are being made to collect and preserve it on a national or domain level, e.g. by national libraries and archives. This means that no one single initiative (with the exception of the Internet Archive) can hope for total coverage of the web. Close collaboration between different web archiving initiatives, therefore, will be extremely important, e.g. to avoid unnecessary duplication in coverage or to share in the development of tools, guidelines, etc.
  • More specifically, there is a need for all organizations involved in web archiving initiatives in the UK to work together. In particular there is the opportunity to work closely with the British Library as it develops its proposals for web archiving as part of the national archive of publications. Potentially, many different types of organization have an interest in collecting and preserving aspects of the UK web, while the British Library (BL), the Public Record Office (PRO) and the British Broadcasting Corporation (BBC) have already begun to experiment with web archiving. The Digital Preservation Coalition (DPC) is well placed to provide the general focus of this collaboration, although there may be a need for specific communications channels.

Challenges   

The web poses preservation challenges for a number of reasons:

  • The web's fast growth rate and 'fluid' characteristics mean that it is difficult to keep up-to-date with its content sufficiently for humans to decide what is worth preserving.
  • Web technologies are immature and evolving all the time. Increasingly, web content is delivered from dynamic databases that are extremely difficult to collect and preserve. Some sites use specific software (e.g. browser plug-ins) that may not be widely available or use non-standard features that may not work in all browsers. Other websites may belong to the part of the web that is characterized by the term 'deep web' and will be hard to find using most web search services and maybe even harder to preserve.
  • Unclear responsibilities for preservation - the diverse nature of the web means that a variety of different organization types are interested in its preservation. Archives are interested in websites when they may contain records, libraries when they contain publications or other resources of interest to their target communities. The global nature of the web also means that responsibility for its preservation does not fall neatly into the traditional national categories.
  • Legal issues relating to copyright, the lack of legal deposit mechanisms (at least in the UK), liability issues related to data protection, content liability and defamation. These represent serious problems and are dealt with in a separate report that has been prepared by Andrew Charlesworth of the University of Bristol.

Approaches   

Since the late 1990s, a small number of organizations have begun to develop approaches to the preservation of the web, or more precisely, well-defined subsets of it. Those organizations that have developed initiatives include national libraries and archives, scholarly societies and universities. Perhaps the most ambitious of these initiatives is the Internet Archive. This US-based non-profit organization has been collecting broad snapshots of the web since 1996. In 2001, it began to give public access to its collections through the 'Wayback Machine'. 

Current web archiving initiatives normally take one of three main approaches:

  • Deposit, whereby web-based documents or 'snapshots' of websites are transferred into the custody of a repository body, e.g. national archives or libraries.
  • Automatic harvesting, whereby crawler programs attempt to download parts of the surface web. This is the approach of the Internet Archive (who have a broad collection strategy) and some national libraries, e.g. Sweden and Finland
  • Selection, negotiation and capture, whereby repositories select web resources for preservation, negotiate their inclusion in co-operation with website owners and then capture them using software (e.g. for site replication or mirroring, harvesting, etc.). This is the approach of the National Library of Australia and the British Library's recent pilot project.

These are not mutually exclusive. Several web archiving initiatives (e.g. the Bibliothèque nationale de France and the National Library of New Zealand) plan to use combinations of both the selective and harvesting based approaches. The selective approach can deal with some level of technical complexity in websites, as the capture of each can be individually planned and associated with migration paths. This may be a more successful approach with some parts of the so-called 'deep web.' However, hardware issues aside, collection would appear to be more expensive (per gigabyte archived) than the harvesting approach. Estimates of the relative costs vary, but the selective approach would normally be considerably more expensive in terms of staff time and expertise. This simple assessment, however, ignores factors related to the cost of preservation over time (whole of life costs), the potential for automation, and quality issues (i.e., fitness for purpose). 

Recommendations

  1. Both JISC and Wellcome Trust should attempt to foster good institutional practice with regard to the management of websites. For example, they could consider the development of website management guidelines for adoption by their user communities or for inclusion in grant conditions, etc.
  2. Until the exact position is clarified by legislation, a selective approach to web archiving - with appropriate permissions secured - would be the best way to proceed for both the JISC and the Wellcome Trust. Other methods of archiving will need to be approached with caution due to problems with copyright and other legal issues (see also the conclusions and recommendations in the associated legal study by Andrew Charlesworth).
  3. If the Wellcome Trust is to meet its strategic objectives in extending its collecting activities into the digital environment, then it will need to consider undertaking some kind of web archiving activity. To achieve this the following approach is recommended:
    • Establish a pilot medical web archiving project using the selective approach, as pioneered by the National Library of Australia (see also Recommendation 5).
    • This pilot should consider using the NLA's PANDAS software for this archiving activity. This pilot could be run independently or as part of a wider collaborative project with other partners.
    • The high-quality medical websites identified in the RDN gateway OMNI should be considered as the starting point for any medical web archiving initiative.
    • The Wellcome Library will need to develop a web archiving selection policy to help ensure that it can archive a broad, representative sample of medical websites. This policy should allow for the inclusion of 'low-quality' (e.g. medical quackery) sites that may be of interest to future historians.
  4. If JISC is to meet its strategic objectives for management of JISC electronic records, in digital preservation and collection development then it will also need to consider undertaking some form of web archiving. To achieve this the following approach is recommended:
    • Establish a pilot project to test capture and archiving of JISC records and publications on project websites using the selective approach, as pioneered by the National Library of Australia (see also Recommendation 5).
    • As part of this pilot, the JISC should define selection policies and procedures.
    • This pilot should consider using the NLA's PANDAS software for this archiving activity. This pilot could be run independently or as part of a wider collaborative project with other partners.
    • Work in collaboration with emerging initiatives from the British Library and Wellcome Trust. There are significant synergies with some existing JISC services websites identified and described by the RDN gateways could be the starting points for any selective subject-based web archiving initiatives in the UK. The RDN gateways contain (November 2002) descriptions of over 60,500 internet resources available on the web.
  5. Research: the current generation of harvesting technologies has limitations with regard to dealing with 'deep web' sites. This has a particular impact on web archiving approaches based on automatic harvesting. While some research is being carried out on this issue from a web search perspective, there is a need for more collaborative research into this issue from the perspective of web archives.

  6. Collaboration: for both the JISC and the Wellcome Trust there is significant opportunity for partnership on web archiving. For example, there will be opportunities to collaborate on strategic, technical, organizational or content issues.
    For the UK, both should attempt to work closely with the British Library, the other copyright libraries, the Public Record Office, data archives and the e-Science centres that have experience of managing large volumes of data. The focus for this collaborative activity could be within the Digital Preservation Coalition (DPC). On an international level, close co-operation with institutions like the US National Library of Medicine and the Internet Archive will be important.
    As an exemplar of collaboration, it is recommended that JISC and the Wellcome Library should seek to work together and with other partners to create their pilot web archiving services. Not only will this realise economies of scale, but more importantly provide a model demonstrating how collaboration can work in practice.
  • Last updated on 07/01/09 by Lisa Clifford