Historical Corpora

ARCHER: A Representative Corpus of Historical English Registers
Period: 1650-1999
Size: 1.8 million words

ARCHER is a multi-genre corpus of British and American English covering the period 1650-1990, first constructed by Douglas Biber and Edward Finegan in the 1990s. It is managed as an ongoing project by a consortium of participants at fourteen universities in seven countries. ARCHER is available upon request.

Chadwyck-Healey Literature Collection
Period: c.1500-c.1950
Size: c. 138 million words

Julia Schlüter has compiled a manual for using these text collections as a linguistic corpus (in German). A full list of the individual components of the collection can be accessed here.

Penn Parsed Corpora of Historical English
Period: 1150-1914
Size: c. 3.9 million words

The Penn Historical Corpora, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are syntactically annotated corpora of prose text samples of English from the indicated time periods. Their syntactic annotation (parsing) permits searching not only for words and word sequences, but also for syntactic structure. The corpora are designed for the use of students and scholars of the history of English, especially the historical syntax of the language. The three components of the Penn Parsed Corpora of Historical English are available upon request:

The Corpus of Late Modern English Texts, version 3.0 (CLMET)
Period: 1710-1920
Size: 34 million words

CLMET3.0 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The corpus is free, see the website for details.

ProQuest Historical Newspapers
Period: 1821-1922
Size:
c. 19 million articles
The ProQuest Historical Newspapers collection comprises c. 19 million articles from seven newspapers dating back as far as 1821. The collection is freely accessible via DBIS.
 

A Corpus of English Dialogues (CED)
Period: 1560-1760
Size: 1.2 million words

Released in Spring 2006, A Corpus of English Dialogues 1560-1760 (CED) is a 1.2-million-word computerized corpus of Early Modern English speech-related texts. The CED is part of the research project "Exploring spoken interaction of the Early Modern English period (1560-1760)" (see e.g. Culpeper and Kytö 1997, 2000, and forthcoming), and was compiled by Merja Kytö and Jonathan Culpeper, in collaboration with Terry Walker and Dawn Archer, at Uppsala and Lancaster Universities. The CED is available upon request.

A Linguistic Atlas of Early Middle English (LAEME)
Period: 1150-1325
Size: 625,000 words

Complete texts (or large samples of very long texts) have been diplomatically transcribed from original manuscripts or facsimiles. Each word and each derivational and inflectional morpheme in the text is lexico-grammatically tagged. The present LAEME CTT consists of 650,000 words tagged at this unprecedented level of detail, enabling investigations at all linguistic levels. The CTT is searchable on the website under LAEME TASKS: TAGGED TEXTS. From each tagged text is derived a text dictionary, which lists all the linguistic material in the tagged texts, arranged by lexico-grammatical tag. The text dictionaries are searchable under LAEME TASKS: TEXT DICTIONARIES. The full tagged texts and text dictionaries are also accessible from the individual entries in the Index of Sources, to be found on the website under Auxiliary Data Sets. Considerable editorial and textual commentary accompanies each tagged text. The corpus has provided the source material for all the related publications listed in the LAEME bibliography (to be found on the website under Auxiliary Data Sets).

Oxford Text Archive (OTA)
The Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Several corpora from the OTA are available upon request:

  • Complete Corpus of Old English
    The Old English electronic corpus is a complete record of surviving Old English except for some variant manuscripts of individual texts. A list of included texts can be found here.
  • Corpus of Biblical Texts in Scots
  • Corpus of Early English Correspondence Sampler
  • A manual can be found here.
  • Corpus of Late Modern English Prose
  • Dictionary of Old English Corpus in Electronic Form
  • The English Language of the North-West in the late Modern English period
  • Helsinki Corpus of English Texts
  • Older Scottish Texts (Edinburgh DOST Corpus)
  • The Helsinki Corpus of Older Scots
  • York-Helsinki Parsed Corpus of Old English Poetry
  • York-Toronto-Helsinki Parsed Corpus of Old English Prose