Data Mining and Machine Learning II - Big Data and Unstructured Data

Course information

This course introduces data mining and machine learning methods used in big data scenarios, most notably regularised regression, and also introduces methods for analysing networks and unstructured data.

Prerequisite Knowledge

Learners should have basic experience with the R programming language (e.g. data management and plotting).

This course is typically taken in year 2 of the MSc in Data Analytics/Data Analytics for Government programme and learners typically have the knowledge and skills covered in our year 1 course.

This course assumes that you have comparative knowledge and skills covered in the following courses, alternatively, you may wish to consider taking some of the courses listed before attempting this course.

Intended Learning Outcomes

By the end of this course learners will be able to:

  • make appropriate use of informal and formal methods of network analysis;
  • describe the challenges of the analysis of highdimensional data and discuss, in a particular context, strategies for tackling big data problems;
  • formulate and fit a regularised linear model, such as ridge regression, the LASSO and partial least squares;
  • infer statements about (conditional) independence from graphical models and factorisations of the joint distribution;
  • describe methods for structural inference in graphical models and apply them in a given context;
  • make appropriate use of informal and formal methods for quantitative text analysis.

Syllabus

Week 1 (sample material)

  • Social network analysis
  • Managing network data
  • Visualising network data

Week 2

  • Network summary statistics
  • Node & edge level summaries
  • Network models
  • Stochastic block models

Week 3

  • Stochastic block models
  • Knowledge graphs
  • Machine learning in graphs

Week 4

  • Elastic nets
  • Lasso and ridge regression
  • Parameter estimation

Week 5

  • Elastic nets
  • Comparison between elastic nets and Lasso and ridge regression

Mid-term week break

Week 6

  • Quantitative text analysis
  • Preprocessing text data
  • Simple summaries of texts

Week 7

  • Peer assessment

Week 8

  • Further quantitative text analysis
  • Keyness, document similarity and keywords
  • Sentiment analysis
  • Topic models

Week 9

  • Graphical models
  • Graphics in R
  • Undirected and directed graphs

Week 10

  • Log-linear models
  • Bayesian networks

“Really interesting and wide-ranging course which covers lots of interesting maching learning and data mining procedures.”

Software

To take our courses please use an up-to-date version of a standard browser (such as Google Chrome, Firefox, Safari, Internet Explorer or Microsoft Edge) and a PDF reader (such as Acrobat Reader). Learning material will be distributed through Moodle. We encourage all learners to install R and RStudio and we provide detailed installation instructions, but learners can also use free cloud-based services (RStudio Cloud). Learners need to install Zoom for participating in video conferencing sessions. We recommend the use of a head set for video conferencing sessions.