A Corpus (plural Corpora) is a large collection of written texts which are used in computational linguistics for analysis of the way language is used. They are most often analyzed using a concordancer.
Types of Corpora
A corpus can be one or more of the following:
- general texts
- texts on a specific subject or genre e.g. scientific papers, Shakespeare plays only, children’s essays, etc…
- texts from a specific varieties of English, e.g. American English or British English, etc…
Analysis of a corpus will bring to light certain ways of language use within that group. For example, it may well show that scientific papers use the passive voice far more often than newspapers do or that certain words are only used among certain groups of speakers.
Methods of Analysis
Corpora are generally searched and analyzed using computers which are able to search and compare millions of text strings in virtually no time. However, computer analysis does sometimes have drawbacks. For example, take these two sentences:
Time flies like an arrow.
Fruit flies like a banana.
Whilst a human can easily distinguish between the two uses of the words, flies and like a computer does not yet find this possible. To get around this, corpora are often tagged or annotated. Typically this would involve human operators giving parts of speech tags to words before they are processed and compared by the computer, thus:
Time [noun] flies [verb] like [adverb] an [determiner] arrow [noun].
Fruit [adjective] flies [noun] like [verb] a [determiner] banana [noun].
This allows, for example, a concordancer to analyze all uses of like as a verb as oppose to like as an adverb.
In the Classroom
Use of corpora in the classroom, for example by using a concordancer, can be carried out by students under the guide of a teacher. This will allow students to see how language is used by native speakers in everyday situations. As a teacher a student may ask questions like, “Do we say the team is or the team are?” If this happens and you have access to the internet, you can have your students find out for themselves and work out which is more appropriate and when.
Incidentally, an online search of the BNC (British National Corpus) shows 109 occurrences of the team is and just 37 occurrences of the team are. Without going into further analysis this should tell your students that, given the choice, it is 3 times more likely to be correct to use the team is than the team are!
The British National Corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late twentieth century from a wide variety of genres with the intention that it be a representative sample of spoken and written British English of that time.
Of the two parts to the 10-million word spoken corpus, one is a demographic part, containing transcriptions of spontaneous natural conversations made by members of the public and the other a context-governed part, containing transcriptions of recordings made at specific types of meetings and events. All the original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.
The corpus is marked up following the recommendations of the Text Encoding Initiative and includes full linguistic annotation and contextual information The most recent edition, from March 2007, is distributed in XML format along with the XAIRA software. It is freely available under a license and is very widely distributed.
The BNC can be searched online for specific words or phrases.
The American National Corpus is a paid membership-based collaboration with the aim of creating an electronic text corpus of American English. The collection will include text and transcripts of spoken data produced from 1990, with the goal of a 100 million word corpus.
ANC Consortium members include publishers, software companies, and academic members. Consortium members have exclusive access throughout the development period and for five years after the first installment of the corpus. The First Release of the American National Corpus (ANC) was made available in mid-fall, 2003. The data includes approximately 11 million words of American English, including written and spoken data and a variety of text types annotated for part of speech and lemma. The corpus is provided in XML format conformant to the XML Corpus Encoding Standard (XCES).