Unit 2: Using Online Corpora

Unit 2: Using Online Corpora

This unit:

  • Introduces four easy-to-use online corpora.
  • Gives practice in using available online text analysis tools, eg word searches and concordances.
  • Looks at intertextuality within and beyond literary texts.

The term corpus, the plural of which is corpora, has in recent years come to signify a large collection of digitised texts, literary and non-literary. These texts can be searched automatically and the results displayed in ways that cast light on the use of language in general and in specific contexts.

Scholars of literature may find different points of interest in corpora, for example:

  • Patterns of language found in a literary text can be compared against those patterns found more generally, to determine if certain aspects of the literary language are unusually frequent or infrequent.
  • The interaction between literary and non-literary language might be explored by searching a large general corpus for incidences of allusion or quotation.
  • Specific literary corpora - of fiction, say, or poetry - might be isolated and searched, to determine how a particular word or phrase is used in a text or set of texts.

Here we look at a few simple activities that introduce you to freely-available online corpora, and some of the basic search functions that can give insight into the way language works.

2.1 Four Online Corpora

This unit makes use of four currently-available online corpora:

BYU-BNC is a version of the British National Corpus (BNC). It is a stable corpus of 100 million words of British English, collected in the 1980s and early 1990s.

The Corpus of American English (CAE) is a changing corpus, currently amounting to 360+ million words of American English, collected between the 1990s and the present day. It is regularly updated.

The TIME corpus is a corpus of texts from the American magazine Time, from its founding in 1923 to 2006. It consists of 100 million words of American journalism.

The Scottish Corpus of Texts and Speech (SCOTS) is a corpus of 4 million words, 300 000 of which consist of speech, recorded and transcribed. Though it is relatively small compared to the others, it has wider dialectal variation and includes a number of extracts of prose fiction and poems in Scots as well as English.

2.2 Word searches

The structure of each corpus is obviously different. BNC, CAE and SCOTS are all general corpora, that is they sample from a variety of literary and non-literary genres. However, even the CAE corpus, with over 360 million words, is a tiny drop in the ocean of language usage in the world today. The SCOTS corpus contains only a fraction of the CAE or BNC word totals, but is richer in some colloquial or geographically specific usages. The TIME corpus is extensive, but it is confined to written American journalistic texts.

To get a sense of how a word or phrase you find in a literary text is related to more general usage, then, you have to choose your corpus - or compare amongst them. To begin, let us consider how expressions from two 19th-century authors, an American and a British writer, are used - if at all - across the four contemporary corpora.

 All kings is mostly rapscallions.
 (Mark Twain, The Adventures of Huckleberry Finn. 1884, Ch. 23)

One has no great hopes from Birmingham. I always say there is something direful in the sound.
(Jane Austen, Emma. 1816  Chapter 36)

Searching the BNC

1. Go to http://corpus.byu.edu/bnc
2. In the Display section, choose Chart
3. Enter rapscallion as the Word(s) to search
4. Leave all the other sections as they are, and click Search
5. Note your results. Pay attention to:

  • The types of text that the word appears in (Spoken, Fiction, Newspaper, Academic, Miscellaneous)
  • The number of occurrences in each text type
  • The total number of words in the documents searched in each text type.

6. Repeat the process for direful

Searching the CAE

1. Go to http://www.americancorpus.org/
2. In the Display section, choose Chart
3. Enter rapscallion as the Word(s) to search
4. Leave all the other sections as they are, and click Search
5. Note your results. Pay attention to:

  • The types of text that the word appears in (Spoken, Fiction, Newspaper, Academic, Miscellaneous)
  • The number of occurrences in each text type
  • The total number of words in the documents searched in each text type.

6. Repeat the process for direful

Searching the TIME corpus

1. Go to http://corpus.byu.edu/time
2. In the Display section, choose Chart
3. Enter rapscallion as the Word(s) to search
4. Leave all the other sections as they are, and click Search
5. Note your results. All the Time texts are from American magazine journalism. Pay attention to:

  • The frequency of occurrence per decade from the 1920s on
  • The total number of words in the documents searched in each decade

6. Repeat the process for direful

Searching the SCOTS corpus

1. Go to http://www.scottishcorpus.ac.uk
2. Click on Advanced Search
3. Click General
4. Click Word Search and Word/Phrase concordance
5. Type rapscallion into the Selection box
6. Scroll down the page to see frequency information and the key word displayed in a concordance. The map facility will also show geographical information about the birthplace or residence of the people who use this word in the corpus documents.
7. Repeat the process for direful.


There has to be a fair measure of caution involved in using contemporary online corpora to investigate the literary language of an earlier period. While the corpora are a treasure-trove of linguistic information, the earliest texts in the corpora date from the 1920s. Patterns of language usage might well have changed in the century or so between the novels being published and the data used to build the corpora.

Even so, comparing the usage of 19th-century novels with 20th and 21st century corpus data can give us a sense of what now seems interesting or unusual - or interesting and expected - about their language. But we do have to be careful about projecting these findings back to periods before the corpora were compiled.


Rapscallion occurs in the dialogue in Mark Twain's Huckleberry Finn, in several places. (See here for a searchable electronic version.)

The quotation cited above occurs in an exchange between Huck and his friend Jim, about a couple of actors and con-men they have befriended, who have been touring the country in a 'tragedy' purporting to feature the famous Shakespearian, Edmund Kean:

Them rapscallions took in four hundred and sixty-five dollars in that three nights. I never see money hauled in by the wagon-load like that before. By and by, when they was asleep and snoring, Jim says:
"Don't it s'prise you de way dem kings carries on, Huck?"
"No," I says, "it don't."
"Why don't it, Huck?"
"Well, it don't, because it's in the breed. I reckon they're all alike,"
"But, Huck, dese kings o' ourn is reglar rapscallions; dat's jist what dey is; dey's reglar rapscallions."
"Well, that's what I'm a-saying; all kings is mostly rapscallions, as fur as I can make out."

The term features several more times in Huckleberry Finn. However, when we look at the large general corpora of British and American English we can see that - at least now - this is a relatively rare word, occurring 3 times in the 100 million words of the BNC, 12 times in the 360 million words of the CAE, 23 times in the 100 million word TIME corpus and only once in the 4 million word SCOTS corpus.

Furthermore, we can see that rapscallion tends to occur in literary contexts: 5 of the 12 occurrences in the BYU-CAE are in prose fiction; the only occurrence in SCOTS is in a passage of fiction.  If the TIME corpus is to be believed, there is a general decline in usage of the word in magazine journalism: it is most frequent in the magazine in the 1920s, decreases in frequency decade by decade until the 1960s, but stages a limited revival, peaking again in the 1990s but at a lower rate than in the 1920s.

The four uses of the word in a brief stretch in Huckleberry Finn - in the narrative and in the dialogue of the two main characters, in quick succession - is one of the markers of the distinctively literary language of the novel and its 'speakers'. Although in many respects the dialogue of Jim and Huck is explicitly marked as colloquial in pronunciation and grammar, in this respect at least, they also 'talk like books'.


The word direful occurs only once in Jane Austen's Emma, in Chapter 36, when one of the characters, Mrs Elton, is gossiping about her brother's neighbours, the Tupmans, with another character, Mr Weston:

"A year and a half is the very utmost that they can have lived at West Hall; and how they got their fortune nobody knows. They came from Birmingham, which is not a place to promise much, you know, Mr. Weston. One has not great hopes from Birmingham. I always say there is something direful in the sound: but nothing more is positively known of the Tupmans, though a good many things I assure you are suspected; and yet by their manners they evidently think themselves equal even to my brother, Mr. Suckling, who happens to be one of their nearest neighbours."

Direful is an even rarer word than rapscallion. It occurs neither in the SCOTS corpus nor even in the BNC. It does occur, however, in the CAE and TIME corpora: 5 times in the CAE (4 of which are in fictional prose) and 10 times in the TIME corpus of magazine journalism (all between 1920 and 1950). This, then, is a term that seems to be increasingly restricted to literary genres.

Unlike rapscallion, direful is not used often enough or prominently enough to indicate a character's general style of speech. Yet we might speculate on today's evidence that the word was a relatively unusual one even in Austen's time, and therefore its use by Mrs Elton catches the reader's attention and so accentuates the comedy in her arbitrary and prejudiced characterisation of Birmingham.

Further activities

Look at a poem, novel or play and select some apparently unusual words to do searches on. Questions to ask yourself include:

  • How unusual are the words?
  • Are they specific to certain types of text?
  • Can you tell if the use of the words is increasing, decreasing or staying stable over time?

With that knowledge of general patterns of usage, what can you say about the specific use in the poem, novel or play?

2.3 Limiting searches and using concordances

It is possible to limit searches to that part of the corpus that deals with fictional prose (BNC, CAE and SCOTS) and plays, songs and ballads (SCOTS).

Limiting searches with BNC and CAE

1. Go to the homepage of the BNC or CAE
2. Under Section click on Fiction.

Limiting searches with SCOTS

1. Go to http://www.scottishcorpus.ac.uk
2. Click on Advanced Search
3. Click on Written
4. Click on Text Type
5. Choose either Poem/song/ballad, Prose: fiction or Short story. Note that by clicking on two or three of these, you can include all documents that fall under these categories.

By limiting searches to these texts, you can investigate, for example, clusters of words in literature that express particular concepts, attitudes or values. By performing a wild-card search, using asterisks in place of word-endings, like fear*, want*, desir*, etc. it is possible to explore how writers express emotions and desires.



Using concordances

Another way of exploring literary texts is by using concordancing programs. These usually show your search item (your 'key word/phrase') in context. Some concordancers, like that found in the SCOTS Advanced Search suite, allow you to order items found before and after your key word/phrase alphabetically.

1. Go to the SCOTS homepage
2. Click on Advanced Search
3. Click on Written
4. Click on Text Type
5. Choose Prose: fiction and Short story.
6. Click on General, Word search and Word/phrase concordance
7. Type in desir*
8. Scroll down until you see the concordance list. At the top of the list, click on the numbers to the left and right of the node, to order and re-order the elements before and after the key word or phrase.
9. Note down the objects of desire, the agents, and any recurrent language that appears to accompany the act of desiring.
10. Repeat for want*, fear*, wish*


By limiting the search to Poem/song/ballad it is possible to identify a sub-corpus in SCOTS of 261 documents of just under 200 000 words. By then searching on desir*  it is possible to see how the words desire(s), desirous, desiring, desired function in this limited set of texts - and who uses them.

A search for desir* identifies only 8 of the 261 documents in the SCOTS corpus, and of these 6 are written by Sheena Blackhall, a poet and novelist from north-eastern Scotland. Sheena Blackhall is a major contributor to the SCOTS corpus, supplying 39 of the 261 documents in the category Poem/song/ballad; and so, some caution must be used when relating the use of this term more generally to her work and that of others. (If you want to see the whole text, click on its title.)

Even so, a glance at the 8 contexts in which desir* occurs hints at the ways in which this word is used in poetry. In three of the eight instances, the use of the key word is in the negative:

 I have no desire to conquer
 I…have no desire to move
 Mags had nurtured not the slightest desire…

Otherwise desire can be characterised as derk (dark) or, metaphorically, as a tangle the loop-to-loop o desire. Desire can be for something elemental (It was the licht he desired) or sexual (I have hopes and desires).

The poems, songs and ballads of the SCOTS corpus can be compared with other, larger corpora.

1. Go to http://corpus.byu.edu/bnc
2. In the Display section, choose List
3. Enter desir* as the Word(s) to search
4. Under Section 1 choose Fiction
5. Leave all the other sections as they are, and click Search
6. Note your results. Pay attention to:

  • The number of occurrences of each form of the word (desire, desires, desired, desiring, etc.)
  • The total number of words in the documents searched in each text type.

Note that desire (as a verb and as a noun) is the most frequently used form of this term. We can explore this yet further by looking (a) at the noun objects that follow the verb desired, (b) at the kinds of verb that precede the noun desire, and (c) at the kinds of adjective that characterise desire in fiction.

1. Go to http://corpus.byu.edu/bnc
2. In the Display section, choose List
3. Enter desired [n*] as the Word(s) to search
4. Under Section 1 choose Fiction
5. Leave all the other sections as they are, and click Search
6. Note your results. Pay attention to:

  • The kinds of object that can be desired

7. Go again to http://corpus.byu.edu/bnc
8. In the Display section, choose List
9. Enter [v*] desire as the Word(s) to search
10. Under Section 1 choose Fiction
11. Leave all the other sections as they are, and click Search
12. Note your results. Pay attention to:

  • The kinds of verbs that collocate with desire

13. Go again to http://corpus.byu.edu/bnc
14. In the Display section, choose List
15. Enter [aj*] desire as the Word(s) to search
16. Under Section 1 choose Fiction
17. Leave all the other sections as they are, and click Search
18. Note your results. Pay attention to:

  • The kinds of adjectives that collocate with desire

In fiction, as opposed to poetry, it seems that what is most desired is an effect, result, behaviour or end rather than a person or aspect of nature. The verbs that collocate with desire tend to be modal auxiliaries (could, would, might, may, can) while the lexical verb most associated with desire is feel, with express some way behind. (Note that an anomaly in the results appears as a result of the frequency of references to A Streetcar Named Desire!) More interesting, perhaps, are the adjective collocations with the noun desire, the 10 most frequent of which are sexual, strong, overwhelming, genuine, great, natural, burning, real, homosexual and urgent, with female at the brink of the top ten. Here we see most clearly the aspects of desire that impact on literary texts: its force, its sexual nature, and a concern for whether or not it is 'real' or 'genuine'.

If we want to look at the words that most frequently collocate with desire in a general corpus and compare them with, say, wish, we can do the following with the 360+ million word general American Corpus of English. We can resist the temptation to limit the searches to fictional texts and look at the language more generally.

1. Go again to http://www.americancorpus.org/
2. In the Display section, choose Compare words. Two search boxes will then appear in the Display section.
3. Enter desire and wish as the Word(s) to search, one in each box
4. Set the Context to 1 and 0 to look only at the words immediately preceding the key words
5. Leave the Section as Ignore and click Search
6. Note your results. Pay attention to:

  • The kinds of adjectives and nouns that collocate with desire
  • The kinds of adjectives and nouns that collocate with wish

A consideration of the top collocates for desire confirms the sense of a sexual force: erotic, burning, intense, overwhelming, natural, insatiable, male, female and homosexual all feature, although the collocate sexual itself is relegated to number 83 in the general corpus. In comparison, wish tends to be something associated with a particular occasion, whether a dying wish or a Christmas, holiday or birthday wish.  Particularity is again associated with one's dearest wish, while a death wish refers to a more general condition. None of these is, of course, related directly to sexuality, a Freudian opposition of the death wish to Eros notwithstanding.


The activities above have moved from a comparison of the use of particular terms in poems or songs to comparison with their use in fiction and in corpora of the language more generally. By paying attention in particular to the forms words take, and to their collocations, that is 'the company they keep', we can begin to see what is unusual - or indeed usual - about the use of a term or set of terms in a literary text. In short, we can see how the individual use of a word or phrase in a literary text relates to its uses in larger language systems, literary and non-literary.

Follow-up activity

Repeat and vary the activities based on desir* and wish* above with another set of terms that you have found in a literary text of your choice. For example, you might wish to explore the italicised words in the quotations below, from Shelley's 'The Question':

 I dreamed that, as I wandered by the way,
 Bare winter suddenly was changed to spring
 And gentle odours led my steps astray
 Mixed with a sound of water's murmuring
 Along a shelving bank of turf, which lay
 Under a copse, and hardly dared to fling
 Its green arms round the bosom of the stream,
 But kissed it and then fled, as thou mightst in dream.

2.4 Intertextuality

Intertextuality refers to the relationship between texts. Often one text will allude to another through the use of quotation. Quotations can offer an interpretation of the situation being described (e.g. X is like this scene in Hamlet, or affords an example of a Wildean witticism). They can also set up a relationship between reader and writer - both are established as part of a community of readers who are familiar with canonical literature.

By using the online TIME corpus, we can find examples of the following quotations being used as intertextual allusions in American journalism. Not all allusions will repeat the full expression, so you will find more results if you search for parts of the quotations or variations on them too. Try searching for the following allusions in the TIME corpus.

  • "What's in a name? that which we call a rose/By any other name would smell as sweet". Shakespeare, Romeo and Juliet  (II. i. 43)
  • "What immortal hand or eye/Could frame thy fearful symmetry?" Blake 'The Tyger'

Find some literary quotations of your own and search for them in the TIME corpus. In each case, consider:

  • What does the use of the allusion add to the text in which it occurs?
  • How easy is it to identify the original source of the quotation?
  • How often is the intertextual nature of the sequence made explicit?
  • Do you notice a change in usage over time?


The incidences of these allusions in the TIME corpus are spread over several decades, attesting to the durability of quotations from canonical literature. The allusions fall into two types, which might be called overt and covert allusions. The former explicitly refer to the source author or text, either humorously or seriously. Thus a comic writer in the 1930s states the following:

One of my boys I named William Shakespeare- after me, not the play writer. I don't take much stock in names. A rose by any other name would smell as sweet, as the fellow said.

In a more serious vein, a 1997 commentator on the cloning of Dolly the sheep alludes to William Blake's poem:

In his mystic forays into the nature of creation, the poet William Blake questioned both the lamb and the tiger about their origins, asking the tiger who it was who could have possibly crafted its "fearful symmetry." "Did he smile his work to see? Did he who made the Lamb make thee?" This year, out of a research institute in Scotland, a lamb named Dolly came roaring similarly existential questions. For Dolly was a clone, and her doubling had a fearful symmetry

Covert allusions do not refer to the source author or text, but simply assume that the reader will spot the connection.They thus serve to bond author and reader in a community that recognises references to canonical literature without being prompted. Thus the headline writer who prefaces a report on a scientific test with an adaptation of the lines from Romeo and Juliet:

When University of Oxford researchers presented volunteers with a vial of cheddar-cheese odor labeled either CHEDDAR CHEESE or BODY ODOR, guess which one they preferred? Sure enough, subjects found the odor significantly more pleasant when they thought they were smelling cheese.

And, again more seriously, Strobe Talbot in a report on the Middle East in 1990 leaves it up to his readers to identify the source of the striking phrase used in his account of the stand-off between warring parties:

The fearful symmetry in that exchange of threats between Baghdad and Jerusalem is what mutual deterrence is all about. It echoes the tacit High Noon dialogue between Moscow and Washington in the worst days of the cold war.

Corpus searches for the continuing use of literary phrases in general language casts light on literature as intellectual capital - we can see how it is traded, adapted, mocked, revered in text after text as part of a continuing dialogue between writers and readers.

Follow-up activity

1. Go to an online dictionary of literary quotations, such as http://www.bartleby.com/ or http://www.online-literature.com/
2. Choose a set of quotations that strikes you as interesting
3. Go to http://corpus.byu.edu/time/ and check if they have been used by writers of Time magazine
4. If a quotation has been used, was it overtly or covertly? Explain why you think the journalist has alluded to the source text or author.

Further study

Anderson, Wendy and John Corbett (2008). Exploring English with Online Corpora. Palgrave: Macmillan.
Baker, Paul (2006). Using Corpora in Discourse Analysis. London: Continuum.
Hunston, Susan (ed.) (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
McEnery, Tony et al (2005). Corpus-based Language Studies: An advanced resource book. London: Routledge.