End date: 1 February 2004
Funding programme: Digital Preservation and Records Management Programme
JISC theme(s): Information environment, e-Administration
Purpose of study
In March 2002, the Wellcome Trust and JISC awarded a contract to UKOLN to undertake a feasibility study into
web archiving. The aims of this study were to provide the Wellcome Trust
and JISC with:
-
an analysis of existing web archiving arrangements and to determine to
what extent they address the needs of the UK research and FE/HE
communities
-
recommendations on how the Wellcome Trust and the JISC could begin to
develop web archiving initiatives to meet the needs of their constituent
communities.
Recognizing that the legal implications of web archiving - copyright, data
protection, defamation etc - are a key concern of any would-be web
archivist, UKOLN contracted the Centre of IT and Law at the
University of Bristol to undertake a separate study into the legal issues
of web archiving.
To validate the findings of both studies an international Advisory Board,
comprising representatives from the British Library, National Library of
Medicine, Library of Congress, Internet Archive and the National Library of
Australia, was established. The reports were circulated to this group in
early December 2002 and without exception, the findings and recommendations
were fully endorsed. See the two reports
Why collect and preserve the web?
-
In the short time since its invention, the World Wide Web has become a
vital means of facilitating global communication and an important medium
for scientific communication, publishing, e-commerce, and much else. The
'fluid' nature of the web, however, means that pages or entire
sites frequently change or disappear, often without leaving any trace.
-
In order to help counter this change and decay, web archiving initiatives
are required to help preserve the informational, cultural and evidential
value of the World Wide Web (or particular subsets of it).
Why should the Wellcome Library be
interested in this ?
-
The Wellcome Library has a particular focus on the history and
understanding of medicine. The web has had a huge impact on the
availability of medical information and has also facilitated new types of
communication between patients and practitioners as well as between these
and other types of organizations. The medical web, therefore, has
potential long-term documentary value for historians of medicine.
-
To date, however, there has been no specific focus on collecting and
preserving medical websites. While the Internet Archive has already collected much
that would be of interest to future historians of medicine, a preliminary
analysis of its current holdings suggest that significant content or
functionality may be missing.
-
There is, therefore, an urgent need for a web archiving initiative that
would have a specific focus on preserving the medical web. The Wellcome
Library is well placed to facilitate this and such an initiative would
nicely complement its existing strategy with regard to preserving the
record of medicine past and present.
Why should JISC be interested in
this?
JISC has a number of areas where web archiving initiatives would directly
support its mission. These include:
-
JISC funds a number of development programmes. It, therefore, has an
interest in ensuring that the web based outputs of these programmes (e.g.
project records, publications) persist and remain available to the
community and to JISC. Many of the websites of projects funded by
previous JISC programmes have already disappeared.
-
JISC also supports national development of digital collections for HE/FE
and the Resource Discovery Network (RDN) services that select and
describe high-quality web resources judged to be of relevance to UK
further and higher education. A web archiving initiative could underpin
this effort by preserving copies of some of these sites, e.g. in case the
original sites change or disappear. The expertise and subject knowledge
of the RDN could in turn assist development of national and special
collections by bodies such as the national libraries or Wellcome Trust.
These collections would be of long-term value to HE/FE institutions.
-
JISC also funds the JANET network used by most UK further and higher
education institutions and, as its operator, UKERNA has overall
responsibility for the ac.uk domain.
Collaboration
-
Collaboration will be the key to any successful attempt to collect and
preserve the web.
-
The web is a global phenomenon. Many attempts are being made to collect
and preserve it on a national or domain level, e.g. by national libraries
and archives. This means that no one single initiative (with the
exception of the Internet Archive) can hope for total coverage of the
web. Close collaboration between different web archiving initiatives,
therefore, will be extremely important, e.g. to avoid unnecessary
duplication in coverage or to share in the development of tools,
guidelines, etc.
-
More specifically, there is a need for all organizations involved in web
archiving initiatives in the UK to work together. In particular there is
the opportunity to work closely with the British Library as it develops
its proposals for web archiving as part of the national archive of
publications. Potentially, many different types of organization have an
interest in collecting and preserving aspects of the UK web, while the
British Library (BL), the Public Record Office (PRO) and the British
Broadcasting Corporation (BBC) have already begun to experiment with web
archiving. The Digital Preservation Coalition (DPC) is well placed to
provide the general focus of this collaboration, although there may be a
need for specific communications channels.
Challenges
The web poses preservation challenges for a number of reasons:
-
The web's fast growth rate and 'fluid' characteristics mean
that it is difficult to keep up-to-date with its content sufficiently for
humans to decide what is worth preserving.
-
Web technologies are immature and evolving all the time. Increasingly,
web content is delivered from dynamic databases that are extremely
difficult to collect and preserve. Some sites use specific software (e.g.
browser plug-ins) that may not be widely available or use non-standard
features that may not work in all browsers. Other websites may belong to
the part of the web that is characterized by the term 'deep web'
and will be hard to find using most web search services and maybe even
harder to preserve.
-
Unclear responsibilities for preservation - the diverse nature of the web
means that a variety of different organization types are interested in
its preservation. Archives are interested in websites when they may
contain records, libraries when they contain publications or other
resources of interest to their target communities. The global nature of
the web also means that responsibility for its preservation does not fall
neatly into the traditional national categories.
-
Legal issues relating to copyright, the lack of legal deposit mechanisms
(at least in the UK), liability issues related to data protection,
content liability and defamation. These represent serious problems and
are dealt with in a separate report that has been prepared by Andrew
Charlesworth of the University of Bristol.
Approaches
Since the late 1990s, a small number of organizations have begun to develop
approaches to the preservation of the web, or more precisely, well-defined
subsets of it. Those organizations that have developed initiatives include
national libraries and archives, scholarly societies and universities.
Perhaps the most ambitious of these initiatives is the Internet Archive.
This US-based non-profit organization has been collecting broad snapshots
of the web since 1996. In 2001, it began to give public access to its
collections through the 'Wayback Machine'.
Current web archiving initiatives normally take one of three main
approaches:
-
Deposit, whereby web-based documents or 'snapshots' of websites
are transferred into the custody of a repository body, e.g. national
archives or libraries.
-
Automatic harvesting, whereby crawler programs attempt to download parts
of the surface web. This is the approach of the Internet Archive (who
have a broad collection strategy) and some national libraries, e.g.
Sweden and Finland
-
Selection, negotiation and capture, whereby repositories select web
resources for preservation, negotiate their inclusion in co-operation
with website owners and then capture them using software (e.g. for site
replication or mirroring, harvesting, etc.). This is the approach of the
National Library of Australia and the British Library's recent pilot
project.
These are not mutually exclusive. Several web archiving initiatives (e.g.
the Bibliothèque nationale de France and the National Library of New
Zealand) plan to use combinations of both the selective and harvesting
based approaches. The selective approach can deal with some level of
technical complexity in websites, as the capture of each can be
individually planned and associated with migration paths. This may be a
more successful approach with some parts of the so-called 'deep
web.' However, hardware issues aside, collection would appear to be
more expensive (per gigabyte archived) than the harvesting approach.
Estimates of the relative costs vary, but the selective approach would
normally be considerably more expensive in terms of staff time and
expertise. This simple assessment, however, ignores factors related to the
cost of preservation over time (whole of life costs), the potential for
automation, and quality issues (i.e., fitness for purpose).
Recommendations
-
Both JISC and Wellcome Trust should attempt to foster good institutional
practice with regard to the management of websites. For example, they
could consider the development of website management guidelines for
adoption by their user communities or for inclusion in grant conditions,
etc.
-
Until the exact position is clarified by legislation, a selective
approach to web archiving - with appropriate permissions secured - would
be the best way to proceed for both the JISC and the Wellcome Trust.
Other methods of archiving will need to be approached with caution due to
problems with copyright and other legal issues (see also the conclusions
and recommendations in the associated legal study by Andrew
Charlesworth).
-
If the Wellcome Trust is to meet its strategic objectives in extending
its collecting activities into the digital environment, then it will need
to consider undertaking some kind of web archiving activity. To achieve
this the following approach is recommended:
-
Establish a pilot medical web archiving project using the selective
approach, as pioneered by the National Library of Australia (see also
Recommendation 5).
-
This pilot should consider using the NLA's PANDAS software for
this archiving activity. This pilot could be run independently or as
part of a wider collaborative project with other partners.
-
The high-quality medical websites identified in the RDN gateway OMNI
should be considered as the starting point for any medical web
archiving initiative.
-
The Wellcome Library will need to develop a web archiving selection
policy to help ensure that it can archive a broad, representative
sample of medical websites. This policy should allow for the
inclusion of 'low-quality' (e.g. medical quackery) sites that
may be of interest to future historians.
-
If JISC is to meet its strategic objectives for management of JISC
electronic records, in digital preservation and collection development
then it will also need to consider undertaking some form of web
archiving. To achieve this the following approach is recommended:
-
Establish a pilot project to test capture and archiving of JISC
records and publications on project websites using the selective
approach, as pioneered by the National Library of Australia (see also
Recommendation 5).
-
As part of this pilot, the JISC should define selection policies and
procedures.
-
This pilot should consider using the NLA's PANDAS software for
this archiving activity. This pilot could be run independently or as
part of a wider collaborative project with other partners.
-
Work in collaboration with emerging initiatives from the British
Library and Wellcome Trust. There are significant synergies with some
existing JISC services websites identified and described by the RDN
gateways could be the starting points for any selective subject-based
web archiving initiatives in the UK. The RDN gateways contain
(November 2002) descriptions of over 60,500 internet resources
available on the web.
-
Research: the current generation of harvesting technologies has
limitations with regard to dealing with 'deep web' sites. This
has a particular impact on web archiving approaches based on automatic
harvesting. While some research is being carried out on this issue from
a web search perspective, there is a need for more collaborative
research into this issue from the perspective of web archives.
-
Collaboration: for both the JISC and the Wellcome Trust there is
significant opportunity for partnership on web archiving. For example,
there will be opportunities to collaborate on strategic, technical,
organizational or content issues.
For the UK, both should attempt to work closely with the British Library,
the other copyright libraries, the Public Record Office, data archives
and the e-Science centres that have experience of managing large volumes
of data. The focus for this collaborative activity could be within the
Digital Preservation Coalition (DPC). On an international level, close
co-operation with institutions like the US National Library of Medicine
and the Internet Archive will be important.
As an exemplar of collaboration, it is recommended that JISC and the
Wellcome Library should seek to work together and with other partners to
create their pilot web archiving services. Not only will this realise
economies of scale, but more importantly provide a model demonstrating
how collaboration can work in practice.