The aim of the project is to develop a technology to automate the process of depositing large amounts of citation data in an institutional database. The technology will rely on the state-of-the-art methods for Information Retrieval and Natural Language Processing to discover citation data of university staff on the Web. Carried out jointly by ILP and LIS of the University of Wolverhampton, the project is expected to stimulate growth of "Wolverhampton Intellectual Repository and E-theses" and potentially similar repositories in other organisation.

AIR (Automated Archiving for an Institutional Repository)


Start date: 1 October 2007

End date: 31 March 2009

Funding programme: Repositories and Preservation programme

Project website: http://clg.wlv.ac.uk/projects/AIR/

JISC theme(s): Information environment, e-Resources, e-Research

Repositories Enhancement Project

Manual deposition of citation data in institutional repositories is an extremely time- and resource-intensive process. These costs act as a bottleneck on the fast uptake of large repositories. The challenge has long been recognised and a number of research projects have attempted to develop the technology for unifying disjointed repositories, their efficient management and re-use. Nonetheless, these technologies still fail to address the main problem of construction and upkeep of bibliographic repositories - the discovery of new citation data in large text collections.

The project will develop an information extraction system allowing for speedy discovery and extraction of bibliographical data on an institutional website. The system will be integrated with the WIRE (Wolverhampton Intellectual Repository and E-theses") repository, but it will be designed in a way to facilitate easy adoption of the software by other institutions that use different data encoding standards. The project is carried out jointly by Research Institute of Information and Language Processing and the University's Learning Information Services.

Aims and objectives 

The project will investigate the degree to which the population of institutional repositories can be automated, in order to maximise the speed of human-supervised compilation of the data, while maintaining its high quality. Employing the state-of-the-art methods for Natural Language Processing and Information Retrieval, the project will design a software architecture that helps a user to:

  • locate relevant documents on the institutional website
  • extract bibliographical entries from them
  • extract information from each entry and tag it with DublinCore metadata tags such as Author, Title, and Year
  • export the extracted data into Open Repository or DSpace workflow
  • facilitate checking of copyright issues using the SHERPA Romeo database

Project methodology

The ILP research staff will be responsible for the research and development activities on the project, which will be concerned with methods to locate relevant web documents on an institutional server, extract and verify bibliographical data from them. These methods will be implemented in three major services of the system: a web crawler, an information extraction component, and the DSpace interfacing component.

The LIS staff would help specify user needs for customisation, oversee trialling within UW (through to testing automated population of the existing WIRE repository), contribute to the Project Steering Group and liaise with the repository community though SHERPA or the United KingdomCouncil of Research Repositories (UKCORR).

The project will liaise with Biomed Central, who supply the hosted DSpace-based Open Repository system on which WIRE runs.

Anticipated outputs and outcomes

The project is expected to bring considerable benefits to the University. Specifically, it will:

  • stimulate significant growth in content in WIRE
  • raise the profile of the University research by increasing the likelihood of citation
  • provide opportunities for ILP researchers to gain experience in knowledge transfer
  • free up LIS staff time by introducing mediated deposition process
  • develop the relationship between LIS and ILP which may lead to further co-operation on advanced information access technologies

Within the wider repository community the project will:

  • produce software that can be used with Open Repository and that can be customised for use with DSpace or Eprints
  • test the concept of partial automation of population and deposition
  • establish the limits of automation
  • establish good practice with regard to automation

The main concrete outcome will be a software architecture integrated with the WIRE repository, but easily customisable for other repositories that use different data encoding standards.

Technology / Standards used

Natural Language Processing, Information Extraction, Information Retrieval,  Text Filtering, Focused Web Crawling

Lead Institution

University of Wolverhampton

project staff

Project Manager
  • Prof. Dr. Ruslan Mitkov, Professor of Computational Linguistics and Language Engineering, and Director of the Research Institute of Information and Language Processing, Research Institute of Information and Language Processing, Universityof Wolverhampton, Stafford St., Wolverhampton. WV1 1SB, Tel: 01902 32 24 71, Fax: 01902 32 35 43 R.Mitkov@wlv.ac.uk
Project Team
  • Dr. Viktor Pekar; University of Wolverhampton, ILP; tel. 01902 32 22 17; fax. 01902 32 35 43; email: V.Pekar@wlv.ac.uk
  • Natalia Ponomareva; University of Wolverhampton, ILP; tel. 01902 32 22 17; fax. 01902 32 35 43; email: Nata.Ponomareva@wlv.ac.uk
  • Frances Machell; University of Wolverhampton, LIS; tel 01902 32 19 65; email: F.Machell@wlv.ac.uk
  • Alison Robinson; University of Wolverhampton, LIS; tel 01902 32 32 29; email: A.Robinson@wlv.ac.uk
  • John Dowd; University of Wolverhampton, LIS; tel 01902 32 26 08; email: John.Dowd@wlv.ac.uk
  • Last updated on 09/01/09 by Lisa Clifford