Introduction#

Modern Statistics is an introductory grad-level course I teach at Princeton University (where it is listed as SML 505). It could easily have been called “practical statistics” or “computational statistics”. This course is modern through the use of computational tools, some of which are still being developed, with which a much wider range of problems becomes practically solvable.

I will assume a moderate proficiency in programming (all examples and assignments will be in python) as well as linear algebra and multivariate calculus. The course is meant to be interdisciplinary, in other words approachable for students from various scientific disciplines. If you want to understand the concepts of a statistical approach to data, this course is for you.

In this course we will take a (reasonably) principled approach to data analysis. That means we will be thinking about models for signals and errors, likelihoods and priors, Bayes rule, etc. We will largely avoid “procedures” that are not underpinned by a probabilistic formulation.

At the most conceptual level, I will assume that there’s a phenomenon that we seek to understand, which is to say that we want to conjure up a model and then determine its “best” parameters. This phenomenon generates data, otherwise we might debate whether the phenomenon even exists (let’s assume that it does). At this stage data are ideal and infinitely many. It’s God’s view of the world, if you believe in such an entity. But we mere mortals don’t have access to them because our data collection utilizes real-world equipment. That makes our observations incomplete. Sampling and noise are examples for this incompleteness, as are truncated data and missing features.

_images/7f699249a82fb42fafabb2917cce244aa50c17e22caa667456e18d871e8c218f.svg

There are two processes in this picture: one that generates the data and one that we use to measure aspects of it. Our goal will be to go backwards and to infer parameters of the phenomenon through its real-world observables.

A consequence of this picture, and the hallmark of statistics, is that we have to deal with data that are noisy and incomplete. In other words, the numbers we’ll be working with are not to be taken as direct representations of the original phenomenon, they are molded by and carry traces of the process by which they have been measured. It’s a surprisingly common mental flaw that we see a number and assign an exact meaning to it. We then believe that we now know something exact about the phenomenon, when in reality our knowledge has just been mildly advanced. We therefore need rules for interpreting these inexact numbers, which we will discuss in chapter Probability. In the pursuit of inference, we also need to incorporate models of the entire data-taking process, usually a signal model and a noise model, which will be the subject of chapter Probabilistic Models.

And we need a conceptual framework how to turn them into the insights we seek. Statistics, it turns out, has two such frameworks: Frequentist and Bayesian. I prefer the latter, but we’ll discuss the ideas behind and differences between both in chapter Bayesian Statistics, and utilize either when appropriate. In several chapters, we will employ probabilistic approaches that are often attributed to the field of Machine Learning (ML). We’ll point out the assumptions they make and use them, too. (Since statisticians can’t even agree whether the word “data” is singular or plural, we have some leeway here.) We will then transition to traditional sampling methods, improve their performance in several ways, and end with advanced methods for notoriously intractable problems.

If you must know, I do draw a line between ML and statistics. In most areas of ML, any way that achieves good performance is fair. This has led to remarkable achievements, especially in perceptive tasks such as labeling images. But it leads to ambiguities, even arbitrariness. I will point out when those arbitrary decisions are made, and I hope that you will develop at least some unease about them. In statistics, we seek to formulate a complete description of the processes at play, which removes the ambiguity. By the end of the course, we will see that there is a strong convergence between statistical and ML approaches, so we find ourselves in an enviable position: We can exploit both to find the best ways to solve a problem at hand.

Literature Recommendations#

In addition to many resources that will be linked throughout these notes, the course follows broadly the book Machine Learning: A Probabilistic Perspective by Kevin Murphy. References refer to the 1st edition of that book. Use it if you want to go more into detail than we can here.

Chapters Probability and Bayesian Statistics follow the book Information Theory, Inference, and Learning Algorithms by David MacKay, which can be downloaded for free. It contains many great thoughts about the connection between information theory and inference.

Another great book, which we’ll borrow from in chapter Alternative Samplers, is A First Course in Bayesian Statistical Methods by Peter D. Hoff. This one is more thorough on the statistical methodology than this course, while we will focus on practical computational approaches.

Undergraduate Curriculum#

These lecture notes, in shortened form, can be used as basis of a statistics course for undergraduate students in STEM fields. Below is a suggestion for such a curriculum:

  1. Sections 1 and 2 on the basics of probabilistic modeling including linear regression

  2. Sections 3.1 for Bayesian models

  3. Section 4 - 4.2 for error propagation and parameter uncertainties

  4. Section 9 for MCMC inference

Computing setup#

For the examples in these lecture notes as well as exercises and, more importantly, your future research you will need a working installation of python with a package manager. If you don’t have those yet, have a look at the Computing Setup instructions.