Text as Data - An Introduction to Document Analytics (M) COMPSCI5096

  • Academic Session: 2019-20
  • School: School of Computing Science
  • Credits: 10
  • Level: Level 5 (SCQF level 11)
  • Typically Offered: Semester 2
  • Available to Visiting Students: No
  • Available to Erasmus Students: No

Short Description

This course will introduce stochastic notions necessary for analysing text/documents, in both unsupervised and supervised approaches. The course is designed to be applied in nature, using various text analytics examples, and introducing students to various toolkits for document processing (e.g. scikit-learn, spaCy, NLTK).


Two one-hour lectures and one one-hour tutorial per week.

Requirements of Entry

This class requires a background in statistics, probability theory, and linear algebra. It is also required to be proficient in Python. This background is roughly equivalent to material covered in Data Fundamentals (H).

Excluded Courses

Text as Data (H)




Examination 70%, written assignment, 10%, coursework 20%.

Main Assessment In: April/May

Are reassessment opportunities available for all summative assessments? No

Reassessments are normally available for all courses, except those which contribute to the Honours classification. For non-Honours courses, students are offered reassessment in all or any of the components of assessment if the satisfactory (threshold) grade for the overall course is not achieved at the first attempt. This is normally grade D3 for undergraduate students and grade C3 for postgraduate students. Exceptionally it may not be possible to offer reassessment of some coursework items, in which case the mark achieved at the first attempt will be counted towards the final course grade. Any such exceptions for this course are described below. 


MSc students can resit the exam.


The coursework cannot be redone because the feedback is provided to students after the original coursework, giving any students redoing, an unfair advantage.

Course Aims

This course aims to introduce students to language modelling, representations of documents, natural language processing, information theory, network models and in general the use of stochastic thinking (including building on the probability theory learned during the Data Fundamentals course). This is illustrated through a number of supervised and unsupervised text processing and analytics approaches as well as their applications such as information extraction, question answering, and dialogue systems.  The course is designed to be applied in nature, using text analytics as an example, and introducing students to various toolkits for document processing (e.g. scikit-learn).

Intended Learning Outcomes of Course

By the end of this course students will be able to:

1. Describe classical models for textual representations such as the one-hot encoding, bag-of-words models, and sequences with language modelling.

2. Identify potential applications of text analytics in practice.

3. Describe various common techniques for classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.

4. Represent data as features to serve as input to machine learning models.

5. Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.

6. Deploy unsupervised and machine learned approaches for document/text analytics tasks.

7. Critically analyze and critique recent developments in natural language and text processing academic literature.

8. Evaluate and explain the appropriate application of recent research developments to real-world problems.

Minimum Requirement for Award of Credits

Students must submit at least 75% by weight of the components (including examinations) of the course's summative assessment.