SAMUELS (Semantic Annotation and Mark-Up for Enhancing Lexical Searches)
The SAMUELS project (Semantic Annotation and Mark-Up for Enhancing Lexical Searches) was funded by the Arts and Humanities Research Council in conjunction with the Economic and Social Research Council (grant reference AH/L010062/1) from January 2014 to April 2015. It produced a system for automatically annotating words in texts with their precise meanings, disambiguating between possible meanings of the same word, with the aim of ultimately enabling a step-change in the way we deal with large textual data. The tagging system uses the Historical Thesaurus of English as its core dataset, and provides for each word in a text the Historical Thesaurus reference code for that concept. Textual data tagged in this way can then be accurately searched and precisely investigated, producing results which can be automatically aggregated at a range of levels of precision. The project also drew on a series of research sub-projects which employed the software as it was being developed, testing and validating the utility of the SAMUELS tagger as a tool for wide-ranging further research.
The SAMUELS consortium consisted of the University of Glasgow (lead institution), Lancaster University, the University of Huddersfield, the University of Central Lancashire, the University of Strathclyde, and Oxford University Press. Our international partners were Brigham Young University (Utah), Åbo Akademi University (Finland), and the University of Oulu (Finland).
The SAMUELS project was funded by the AHRC with the ESRC
Aims and Timescale
As humanities datasets get larger, we need more sophisticated techniques of analysis; this is partly because manually reading and checking the information gathered in ‘big data’ repositories is no longer feasible, but also because there is a latent potential to explore these corpora in ways which have not been possible before the creation of such vast datasets. It is increasingly important that we be able to employ ‘distant reading’ techniques to extract conceptual and historical data from corpora which are too large to read in any standard way.
The most significant issue in distant reading research is that our primary methodology for searching these datasets, then aggregating and analysing the results, relies not on concepts or meanings but rather on word forms. These forms are imperfect and evasive proxies for the meanings they refer to; 60% of word forms in English refer to more than one meaning, and some word forms refer to close to two hundred meanings. For example, the word ‘spring’ has 150 possible meanings. As a result, the irrelevant "noise" which appears when searching using word forms grows with the size of the texts being searched – a traditional search for the word ‘spring’ would return every use of the word to denote a season of the year, as well as every use for a coiled piece of metal, and (potentially) uses for a type of salmon, a name for egg-yolk, and effecting another’s escape from prison.
In big data contexts, this problem cripples research, with analyses becoming entirely intractable and requiring impossible amounts of manual intervention. The SAMUELS project produced software - the Historical Thesaurus Semantic Tagger (HTST) - which automatically ‘tags’ every word in any given electronic corpus with the correct meaning of the word in context. The result is the possibility for meaning-based rather than word-based searching of texts. Searching is not, however, limited to queries of a ‘one word with one meaning’ format. The structure of the Historical Thesaurus is such that it arranges every word in a complex hierarchy of meanings. As a result, searches can be performed on the basis of concepts without the need to identify a particular word as the sole search term. Alternatively, users can search on a truncated category set allowing faster access to important concepts. Using the HTST, resources such as the Hansard Corpus, consisting of over 2.3 billion words of text, or the EEBO-TCP corpus of 40,000 early modern books, can quickly and efficiently be mined for frequently-occurring or statistically unusual concepts, can be precisely searched for fine-grained meanings, or can be examined for striking or significant patterns of meaning change across time.
As a result, the HTST permits researchers to focus their work on comprehensive analyses of concepts (such as power, morality, disease, faith, emotions, war, food, or the senses) rather than on finding appropriate words to search for in corpora, and it facilitates new techniques of exploring, searching and investigating large-scale phenomena in big humanities datasets.
The Historical Thesaurus Semantic Tagger was programmed by computing and linguistics researchers at the University of Lancaster’s University Centre for Computer Corpus Research on Language (UCREL) with the input of semanticists at the University of Glasgow. It builds on the Lancaster team’s highly successful UCREL Semantic Analysis System (USAS), which was already capable of tagging texts with semantic labels based on a thesaurus which contains over 45,000 words and almost 19,000 multi-word expressions. The Historical Thesaurus of English dataset is considerably larger at almost 800,000 entries, and the use of such a vast thesaurus provides the ability to make much finer-grained distinctions in word meanings. Once a prototype of the tagging software has been developed, it was trialled by the sub-projects (detailed below), thus providing immediate benefit to academic research whilst its efficacy as a tool for future researchers was established.
The project lasted for 15 months and had three phases:
The first 6 months involved the core team at Glasgow and Lancaster. The first version of the semantic annotation software was built, which included the construction and testing of a series of word sense disambiguation routines and the integration of this annotator into a workflow which included variant spelling normalisation and part of speech tagging.
The next 6 months involved the iterative improvement of the system's disambiguation, supported and led through the three defined research projects (detailed above) which used big data resources to undertake humanities research and which were designed as test cases which could provide detailed, grounded, and wide-ranging feedback as to the nature and design of tools necessary to explore the data the project software created. Many of the partners in the project came on-line during the second phase, as users whose focus on scholarly and digital outputs are essential to inform the design and implementation work which made up the project's core.
The final stage comprised the final three months of the project, providing time to complete, publish and launch the Historical Thesaurus Semantic Tagger (HTST) and its interfaces, and also for the research-based sub-projects to focus on producing outputs which demonstrate the utility and power of a semantically annotated corpus.
Academic Project Team
Dr Marc Alexander, University of Glasgow (PI)
Jean Anderson, University of Glasgow
Professor Dawn Archer, University of Central Lancashire
Dr Alistair Baron, Lancaster University
Professor Jonathan Hope, University of Strathclyde
Professor Lesley Jeffries, University of Huddersfield
Professor Christian Kay, University of Glasgow
Dr Paul Rayson, Lancaster University
Dr Brian Walker, University of Huddersfield
Brian Aitken, University of Glasgow
Dr Fraser Dallachy, University of Glasgow
Dr Scott Piao, Lancaster University
Dr Jane Demmen, University of Huddersfield
Bethan Malory, University of Central Lancashire
Stephen Wattam, Lancaster University
Elina Koristashevskaya, University of Glasgow, Project Assistant
Meg MacDonald, University of Glasgow, Financial and administrative assistance
Hansard and Parliament from Above - University of Glasgow
Project Team: Dr Marc Alexander, Professor Christian Kay, Dr Fraser Dallachy
This project used the semantically annotated Hansard records in order to map broad-scale patterns in the semantic categories which have been most heavily used in parliamentary discourse across time. The key phrases and concepts employed were visualised to indicate what semantic areas have been employed heavily by particular MPs and political parties in the past two hundred years. Also of interest were subjects which show unusual behaviour such as sudden spikes or dips in usage. These are highlighted in the visual representations, as are pre-selected topics of interest, such as war, honesty, civilization, honour, homosexuality, and terrorism.
Is There a Baron in the Commons? - University of Huddersfield
Project Team: Professor Lesley Jeffries, Dr Brian Walker, Dr Jane Demmen
The language used in connection with a subject reveals much about a speaker’s conception of that subject, and the more contentious the subject, the stronger the use of imagery is likely to be. This project used the semantically tagged corpus of Hansard debates (1803-2003) to investigate the language used to discuss trade unions in political speeches. Beginning with the governments of Margaret Thatcher and James Callaghan, the project team searched for the semantic groupings which collocate with discussion of trade unions. This allowed them to trace the linguistic imagery and associations which politicians have used in their representation of trade unions, and thus the way in which they have wished to conceptualise – or to have their listeners conceptualise – the unions and their members.
Delineating Aggression Across Genres - University of Central Lancashire
Project Team: Professor Dawn Archer, Bethan Malory
This project followed on from Professor Archer’s previous work on the Old Bailey corpus, tagged using the USAS (UCREL Semantic Analysis System) tagger, the predecessor of the Historical Thesaurus Semantic Tagger. It explored the ‘aggression space’ in subsections of the Early English Books Online (EEBO) and Hansard corpora, focussing on the times of unrest such as the 'War of 1812' and the 'Winter of Discontent'. The project identified strings of collocated semantic and psrt-of-speech tags which were indicative of the expression of aggravation, conflict, and imploiteness. This work has resulted in the categorisation of facework strategies employed by speakers when expressing aggression, with these linked to the combinations of tags with which they are associated.
Visualising English Print (1470 - 1800) - University of Strathclyde
The Visualising English Print (VEP) project’s ongoing research aims to develop methods of analysing and displaying data on linguistic and generic variation in the first three centuries of English printed books. Professor Jonathan Hope at Strathclyde is leading the SAMUELS-related branch of VEP, which will investigate the ways in which semantically tagged data for the EEBO corpus could be employed to help visualise such features in these texts, particularly semantic components of genre variation. Applying the data to a different sort of problem, VEP will explore whether the tagged EEBO data can be used to strengthen the metadata attached to the large data sources employed for their research.
Tools and Corpora
The SAMUELS consortium has produced a considerable amount of output connected to the work conducted for the project. The most important of these is the Historical Thesaurus Semantic Tagger (HTST) itself, which can be downloaded from the link below. The zipped file contains a readme text which advises the user on how to run the GUI. This is suitable for use on small to medium-sized bodies of text (up to c. 100,000 words). For larger bodies of text it is advisable to contact Paul Rayson at UCREL to discuss the use of their servers for the tagging process.
The semantically tagged Hansard corpus is now available via Mark Davis' corpora site at Brigham Young University, with the EEBO-TCP corpus to follow.
Presentations, Publications and Dissemination Work
The following activities have been or are planned to be undertaken by members of the consortium either directly in relation to the project, or employing the HTST in their methodology.
Conference Papers and Presentations
DH 2014, Lausanne – Metaphor, Popular Science, and Semantic Tagging. Marc Alexander, Jean Anderson, Fraser Dallachy, Christian Kay, Scott Piao, Paul Rayson (July 2014)
DHC 2014, Sheffield – Developing the Historical Thesaurus Semantic Tagger. Scott Piao, Fraser Dallachy, Alistair Baron, Paul Rayson, Marc Alexander (September 2014)
Europeana Cloud Workshop, The Hague – Historical Linguistics/Psychology presentation. Paul Rayson (December 2014)
Syphilis Symposium 2015, Glasgow – “Dear was the Conquest of a new found World”: Digital humanities and the language of syphilis. Marc Alexander (January 2015)
CILC 2015, Madrid – Big Data Challenges with Big Corpora and Big Taxonomies. Paul Rayson (March 2015)
ICAME 2015, Trier – Large-scale time-sensitive semantic analysis of historical corpora. Paul Rayson, Alistair Baron, Scott Piao, and Steven Wattam (May 2015)
Political Discourses: Multidisciplinary approaches, UCL – Investigating the lexis of labour relations in UK House of Commons debates over time: a study of parliamentary language using corpus linguistic methods and automated semantic tagging. Jane Demmen, Lesley Jeffries, Brian Walker (June 2015)
PALA 2015, Kent – “flat and insipid, damp’d and extinguish’d, bitter’d and poison’d”: Insipidity and Taste in Early Modern English. Marc Alexander, Fraser Dallachy (July 2015)
PALA 2015, Kent – Is there a Baron in the Commons? The lexis of labour relations in parliamentary language across time. Jane Demmen, Lesley Jeffries, Brian Walker (July 2015)
CL 2015, Lancaster – Semantic Tagging and Early Modern Collocates. Marc Alexander, Alistair Baron, Fraser Dallachy, Scott Piao, Paul Rayson, Stephen Wattam (July 2015)
CL 2015, Lancaster – Tracing Verbal Aggression over time, using the Historical Thesaurus of English. Dawn Archer & Bethan Malory (July 2015)
ICLC13, Newcastle (July 2015)
• ‘The Lexis of Labour Relations in Hansard across Time: Perspectives from the HTE’. Jane Demmen, Lesley Jeffries, Brian Walker
• ‘Mapping Aggression over Time Using the Historical Thesaurus of English’. Dawn Archer, Bethan Malory
• ‘Populating Input Spaces: Conceptual blending and the Historical Thesaurus of English’. Marc Alexander, Fraser Dallachy
Alexander, Marc, Fraser Dallachy, Scott Piao, Alistair Baron, Paul Rayson. ‘Metaphor, Popular Science and Semantic Tagging: Distant reading with the Historical Thesaurus of English’, Digital Scholarship in the Humanities (DSH) (in press)
Archer, Dawn, Merja Kytö, Alistair Baron, Paul Rayson (2015). ‘Guidelines for normalising Early Modern English corpora: Decisions and justifications’. ICAME Journal (39)
Archer, Dawn (2015). Slurs, insults, (backhanded) compliments and other strategic facework moves. Language Sciences [in preparation]
Further papers are in preparation by the researchers at Lancaster, UCLAN, and Huddersfield. Details of these will be added here as they are finalised.
Stall at Loncon 3, The 72nd World Science Fiction Convention (London; August 2014)
Stall at Explorathon, European Researchers’ Night (Glasgow; September 2014)
Final Project Meeting and Future Plans
A two-day meeting was held to conclude the SAMUELS project on the 26th-27 March 2015. This allowed the entire team to discuss the methodologies they had employed for their work, present results of their research, and feed back on use of the tagger and the tagged data it provided. Powerpoint presentations representing the different aspects and strands of the project are available below.
Possibilities for future work were discussed at the meeting, with the consortium planning to build on the research and results achieved during the SAMUELS project. Further research at Glasgow has already begun to use the semantically-tagged EEBO-TCP corpus as part of the 'Linguistic DNA of Modern Thought' project, led by Susan Fitzmaurice at the University of Sheffield. Glasgow and Lancaster aim to evolve the function of the tagger to recognise higher-level linguistic features, whilst the teams at Huddersfield and UCLAN are using the data in their ongoing investigation of the language of labour relations and aggression respectively.
Final Project Meeting Presentations
Lancaster: HTST Evaluation Notes - Scott Piao
Glasgow: Update on Glasgow Input to HTST - Fraser Dallachy
Huddersfield: Is There a Baron in the Commons? Project Update - Lesley Jeffries, Brian Walker, Jane Demmen
UCLAN: Tracing Verbal Aggression over Time - Dawn Archer, Bethan Malory