Text as Data - An Introduction to Document Analytics (H) COMPSCI4074

  • Academic Session: 2023-24
  • School: School of Computing Science
  • Credits: 10
  • Level: Level 4 (SCQF level 10)
  • Typically Offered: Semester 2
  • Available to Visiting Students: Yes

Short Description

This course will introduce stochastic notions necessary for analysing text/documents, in both unsupervised and supervised approaches. The course is designed to be applied in nature, using various text analytics examples, and introducing students to various toolkits for document processing (e.g. Weka/Spark Mlib).


Two one-hour lectures and one one-hour tutorial per week.

Requirements of Entry

Data Fundamentals (H) (or equivalent)

Excluded Courses

Text as Data (M)




Examination 80%, coursework 20%.

Main Assessment In: April/May

Are reassessment opportunities available for all summative assessments? No

Reassessments are normally available for all courses, except those which contribute to the Honours classification. For non-Honours courses, students are offered reassessment in all or any of the components of assessment if the satisfactory (threshold) grade for the overall course is not achieved at the first attempt. This is normally grade D3 for undergraduate students and grade C3 for postgraduate students. Exceptionally it may not be possible to offer reassessment of some coursework items, in which case the mark achieved at the first attempt will be counted towards the final course grade. Any such exceptions for this course are described below. 


Because feedback and solutions to the coursework will be provided, reassessment of the coursework is not possible.

Course Aims

This course aims to introduce students to language modelling, geometric interpretations of documents, information theory, network models and in general the use of stochastic thinking (including building on the probability theory learned during the Data Fundamentals course). This is illustrated through a number of supervised and unsupervised text processing and analytics approaches as well as their applications such as bibliometric analysis, web link analysis, and social media analytics.  The course is designed to be applied in nature, using text analytics as an example, and introducing students to various toolkits for document processing (e.g. Weka/Spark Mlib).

Intended Learning Outcomes of Course

By the end of this course students will be able to:

1. Describe classical models for textual representations such as the vector space, network graphs, language modelling.

2. Identify potential applications of text analytics in practice.

3. Describe various common techniques for regression, classification, clustering and topic modelling, and select the appropriate machine learning task for a potential document processing application.

4. Represent data as features to serve as input to machine learning models.

5. Assess machine learning model quality in terms of relevant error metrics for document processing tasks, in an appropriate experimental design.

6. Deploy unsupervised and machine learned approaches for document/text analytics tasks.

Minimum Requirement for Award of Credits

Students must submit at least 75% by weight of the components (including examinations) of the course's summative assessment.