Methods for Text Mining and Analysis of Text Corpora Workshop, 16th-18th April 2018

This page contains material from the workshop held as part of the SGSAH/SGSSS Spring into Methods programme 2018. The two-and-a-half-day workshop aimed to introduce students to the basic tools and methodologies which can be used to process and analyse electronic text for research purposes. Many of the tools and methods employed have been developed by the digital humanities research community and are extensively used in the field known as corpus linguistics, but have a wider application and may be useful to researchers across the arts, humanities, and social sciences.

The first day of the workshop introduced the underlying concepts for digital analysis of text, including what text collections can be used for research, basic tools, statistical measures necessary for analysis, and common metadata associated with words. Day two introduced techniques for creating visual representations of textual data, looked at resources for meaning (as opposed to word) searching, and discussed using text analysis to improve police negotiation techniques - a real-world situation in which choice of words and expression can have vital consequences. Day three concluded with a look at another application of text analysis; the identification of authorship through automatic analysis of writing.

We regret that materials from the police negotiation session cannot be made publically available due to the sensitive nature of some of the content.

Workshop Handouts

Day 1 (Monday 16th April 2018)

Basic Analysis Handout I

Basic Analysis Handout II

Statistics (Statistics Handout I, Statistics Handout II)

Annotation Handout


Day 2 (Tuesday 17th April 2018)

Stylistics Handout

Meaning Handout

Visualisation Handout


Day 3 (Wednesday 18th April 2018)

Stylometry (Stylometry I, Stylometry II)

Working Folders

Additional content for two of the workshops is provided below. Handouts associated with the material are included in the .zip files.

The working folder for the Stylometry session is too large to be hosted here; please email if you would like a copy of this folder.

Statistics Working Folder

Visualisation Working Folder

Tools and Resources

Below are links to some of the text analysis tools and resources which were used in the workshop.


AntConc - downloadable software, includes features such as concordancing and identification of collocates and keywords

CLAWS - part of speech tagger, available as a free tool for short stretches of text, also part of the HTST (see below)

Corpus Linguistics in Context (CLiC) - a combined corpus and tool providing an interface for conducting stylistics research on Dickens and 19th C texts

The Historical Thesaurus Semantic Tagger - a tool for annotating texts based on the semantic categories of the Historical Thesaurus of English. Also available in an online interface

Palladio - a network visualisation tool for creating graphics based on textual data input

R - a programming language often used by statisticians, good at handling analysis of statistics for texts

Stylometry for R - a package allowing stylometric analysis of texts through a user interface

The USAS Tagger - the forerunner to the HTST, based on a more limited semantic categorisation but with more extensive user refinement over time



The Historical Thesaurus of English - a highly-detailed semantic categorisation of the word senses in English throughout the langauge's history

Mapping Metaphor - a visualised dataset of the metaphorical links between semantic categories in the Historical Thesaurus

The Oxford English Dictionary - the leading dictionary of English on historical principles; can be accessed through community as well as university library log-in


Text and Corpora

Brigham Young corpora - a suite of corpora with excellent user interface created and maintained by Mark Davies at Brigham Young University.

The British National Corpus (BNC) - the homepage of this resource, which is available in different formats and platforms

Brigham Young BNC - a very useful web interface for searching the BNC

The Hansard Corpus - the record of speech in the Houses of Commons and Lords in the UK parliament. Semantically tagged, and hosted on the BYU corpora page

The Oxford Text Archive (OTA) - a repository of electronic text available for researchers to download and use

Project Gutenberg - an online repository of free-to-use out of copyright texts maintained by volunteer contributors

The Scottish Corpus of Text and Speech (SCOTS) - freely available corpus of Scots and Scottish English