Data Science Foundations

Course information

This course introduces students to data analytics and data science as well as different approaches to learning from data and provides an introduction to statistical model-based inference.

Prerequisite Knowledge

Learners should have a basic understanding of mathematics including matrix algebra and calculus, for example differentiation. Learners should also have basic experience with the R programming language (e.g., data management and plotting).

This course is typically taken in year 1 of the MSc in Data Analytics/Data Analytics for Government programme.

This course assumes that you have comparative knowledge and skills covered in the following courses, alternatively, you may wish to consider taking some of the courses listed before attempting this course.

Intended Learning Outcomes

By the end of this course learners will be able to:

  • explain different types of data and data structures and discuss advantages and challenges of using data of different types in a given context;
  • describe different ways of collecting data and discuss advantages and challenges of using data obtained from different sources in a given context;
  • describe and visualise structured and unstructured data of different types using suitable summaries and plots;
  • explain different approaches to learning from data and discuss their advantages and disadvantages in a given context;
  • define and contrast population and sample, parameter and estimate;
  • implement these statistical methods using the R computer package;
  • write down and justify criteria required of 'good' point estimators, and check whether or not a proposed estimator within a stated statistical model satisfies these criteria;
  • apply the principle of maximum likelihood to obtain point and interval estimates of parameters in statistical models, making appropriate use of numerical methods for optimisation;
  • formulate and carry out hypothesis tests in Normal models, as well as general likelihood-based models, correctly using the terms null hypothesis, alternative hypothesis, test statistic, rejection region, significance level, power, p-value;

Syllabus

Week 1

  • What is learning from data?
  • Data sources, structures and data types
  • Collecting and collating data

Week 2

  • Summarising and visualising data
  • Quality assuring data
  • Exploring relationships in data

Week 3

  • What is statistical inference?
  • A framework for hypothesis testing
  • Interpreting confidence intervals and p-values

Week 4

  • Calculating confidence intervals and constructing hypothesis tests for one/two sample problems
  • Computing the intervals and tests in R
  • Interpreting output from confidence intervals and hypothesis tests

Week 5 (sample material)

  • Properties of point estimators
  • The idea and concept of maximum likelihood
  • Maximum likelihood estimation for discrete distributions

Mid-term week break

Week 6

  • Maximum likelihood for continuous distributions
  • Maximum likelihood estimation on a boundary
  • Numerical optimisation
  • Properties of point estimators

Week 7

  • Definitions of relative likelihood and relative log-likelihood
  • Likelihood intervals
  • Large sample properties to obtain confidence intervals
  • Interpreting the results of these intervals

Week 8

  • Maximum likelihood for the normal distribution and multiple independent populations
  • The Hessian matrix
  • Properties of Maximum Likelihood Estimators

Week 9

  • Approximate confidence intervals to compare parameters from independent populations
  • Interpreting the results of these intervals
  • Comparing hypotheses using likelihood
  • Type I and Type II errors and statistical power

Week 10

  • Large sample properties for a Generalised Likelihood Ratio Test (GLRT)
  • Applying and interpreting results from a GLRT
  • Deriving a GLRT for the multinomial distribution

Supplementary Material

  • Motivation for Bayesian inference
  • Using Bayes' theorem to obtain posterior distributions
  • Visualising prior and posterior distributions and likelihoods in R

“Masterclass in how to deliver a teaching module. Course notes were clear and concise with tasks that were relevant and required application of knowledge. Videos always clearly explained.”

Software

To take our courses please use an up-to-date version of a standard browser (such as Google Chrome, Firefox, Safari, Internet Explorer or Microsoft Edge) and a PDF reader (such as Acrobat Reader). Learning material will be distributed through Moodle. We encourage all learners to install R and RStudio and we provide detailed installation instructions, but learners can also use free cloud-based services (RStudio Cloud). Learners need to install Zoom for participating in video conferencing sessions. We recommend the use of a head set for video conferencing sessions.