Methods for Text Mining and Analysis of Text Corpora Workshop, 16th-18th April 2018
This page contains material from the workshop held as part of the SGSAH/SGSSS Spring into Methods programme 2018. The two-and-a-half-day workshop aimed to introduce students to the basic tools and methodologies which can be used to process and analyse electronic text for research purposes. Many of the tools and methods employed have been developed by the digital humanities research community and are extensively used in the field known as corpus linguistics, but have a wider application and may be useful to researchers across the arts, humanities, and social sciences.
The first day of the workshop introduced the underlying concepts for digital analysis of text, including what text collections can be used for research, basic tools, statistical measures necessary for analysis, and common metadata associated with words. Day two introduced techniques for creating visual representations of textual data, looked at resources for meaning (as opposed to word) searching, and discussed using text analysis to improve police negotiation techniques - a real-world situation in which choice of words and expression can have vital consequences. Day three concluded with a look at another application of text analysis; the identification of authorship through automatic analysis of writing.
For any questions about the workshop, please contact Fraser Dallachy: email@example.com . We regret that materials from the police negotiation session cannot be made publically available due to the sensitive nature of some of the content.
Day 1 (Monday 16th April 2018)
Day 2 (Tuesday 17th April 2018)
Day 3 (Wednesday 18th April 2018)
Additional content for two of the workshops is provided below. Handouts associated with the material are included in the .zip files.
The working folder for the Stylometry session is too large to be hosted here; please email Fraser.Dallachy@glasgow.ac.uk if you would like a copy of this folder.
Tools and Resources
Below are links to some of the text analysis tools and resources which were used in the workshop.
AntConc - downloadable software, includes features such as concordancing and identification of collocates and keywords
CLAWS - part of speech tagger, available as a free tool for short stretches of text, also part of the HTST (see below)
Corpus Linguistics in Context (CLiC) - a combined corpus and tool providing an interface for conducting stylistics research on Dickens and 19th C texts
Palladio - a network visualisation tool for creating graphics based on textual data input
R - a programming language often used by statisticians, good at handling analysis of statistics for texts
Stylometry for R - a package allowing stylometric analysis of texts through a user interface
The USAS Tagger - the forerunner to the HTST, based on a more limited semantic categorisation but with more extensive user refinement over time
The Historical Thesaurus of English - a highly-detailed semantic categorisation of the word senses in English throughout the langauge's history
Mapping Metaphor - a visualised dataset of the metaphorical links between semantic categories in the Historical Thesaurus
The Oxford English Dictionary - the leading dictionary of English on historical principles; can be accessed through community as well as university library log-in
Text and Corpora
Brigham Young corpora - a suite of corpora with excellent user interface created and maintained by Mark Davies at Brigham Young University.
The British National Corpus (BNC) - the homepage of this resource, which is available in different formats and platforms
Brigham Young BNC - a very useful web interface for searching the BNC
The Hansard Corpus - the record of speech in the Houses of Commons and Lords in the UK parliament. Semantically tagged, and hosted on the BYU corpora page
The Oxford Text Archive (OTA) - a repository of electronic text available for researchers to download and use
Project Gutenberg - an online repository of free-to-use out of copyright texts maintained by volunteer contributors
The Scottish Corpus of Text and Speech (SCOTS) - freely available corpus of Scots and Scottish English