Formalities
- Regarding the contract
- More information
Prerequisites
Overall objective
General advice
Projects

This document outlines three thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen. If there are less than 5 students, only one project will be offered, and if there are less than 10 students, only two projects will be offered. Which ones depend on the interests of the majority.

There will be an info meeting on Monday, August 30, 9.15-10.45, in Auditorium 7

Previous years project proposals are available for the spring 2021, fall 2020, the spring 2020, the spring 2019 and the fall 2019.

Formalities

The thesis is written during block 1 and block 2, 2021. The start date is September 6 and the thesis is handed in on January 14. There is a subsequent oral defense.

The thesis can be written in Danish or English.
It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
You should be signed up via Selvbetjeningen.
You will have to decide which project you will work on by September 1 (email: susanne@math.ku.dk) - then it will be decided if all projects will be offered.
You will have to send me a proposed title and description of your project by September 6 at 15h (and I will give you feedback).
You will have to fill out and submit the contract before September 9.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to me, and I will submit it with my signature. Remember to sign the contract. Here follows some information that needs to go into the contract.

The meeting frequency will be in groups once every second week for 1,5 hours during block 1 and in block 2 there will be individual meetings. In the block 1 group meetings, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 2 meetings will be individual meetings by default. There will be 4 group meetings (for each subject) and 3 individual meetings of 30 minutes. The first group meeting will be in week two of the blok, so you have the first week to read and prepare for the presentation at the group meeting. For now it seems that onsite meetings are possible. The individual meetings can be onsite or on zoom.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are the course Mathematical Statistics (or Statistics 1 and 2, or equivalent), but you are also expected to be interested in the following:

carry out data analysis and model validation on real data
implement models and/or data analyses (e.g. by writing R scripts)
learn to use new software packages and functions
find relevant literature
independently read up on the background theory of the project
write a project that reflects theory as well as applications

Overall objective

The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data. It also includes that you search for relevant literature.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report can be found in the following references.

GL: Guidelines to writing a thesis in statistics by Björn Andersson, Shaobo Jin and Fan Yang-Wallentin from the Department of Statistics, Uppsala University. However, note that these are recommendations to help you, they are NOT requirements, in particular: You can use any reference style, and you should not make 1.5 line spacing, but 1 line spacing.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

Predicting the dynamics of covid-19

This project aims at predicting or understanding the dynamics of covid-19. One particular problem which is very important for policy makers in the current situation is to predict number of infected and number of needed health care resources in different countries and under different scenarios of interventions, such as using face masks in public places or closing schools or work places etc.

The project should focus on some specific sub-target. The aim could for example be any of the following:

Prediction of future number of infected. Here, one could start by understanding and trying to reproduce what the expert group at Statens Serum Institute has predicted, and then investigate sensitivity to missspecified parameters or model deviations.
Estimation of the reproduction number (what Statens Serum Institut calls “kontakttallet”, the number of persons that an infected person infects on average), which changes over time for example due to changed social codes for contacts or societal measures.
Estimation of the dark figure (“mørketallet”, the proportion of supposedly immune in the population) based on non-randomised data.
Estimation of disease specific parameters, such as the distribution of the latent period (period from infection until onset of symptoms), infectious period, infectiosness, proportion of infected without/with minor/with severe symptoms etc.
Parameter estimation in compartment models - which parameters can be identified, what data is necessary etc.
The project can focus on deviations from the standard SEIR-model, such as models that include super spreaders, see e.g. this paper, which has attracted a lot of media attention.
The development of the pandemic worldwide where parts of the world have massive vaccine coverage, whereas other parts have limited access to vaccines. Estimates of effects of mutants.

Within each subject, the project can focus more on theoretical development and simulations, or analysis of data, and one can choose to look at subsets of data (only Danish data, or data from some specific country, either because the epidemia is more severe there or because better data is available, or worldwide data).

The first part of this project will be like a journal club consisting of reading some papers on epidemiological models and different inference tools.

Data

There is a lot of data sources about covid-19, and many of them are being updated on a daily or weekly basis. Depending on the problem you choose to focus on in your project different data sets might be relevant. Here are some main sources giving number of effected, tested, hospitalized, deaths etc. Depending on your project, you should probably only choose one of these data sources - or you can also find your own data, since most research published on covid-19 includes access to data.

SSIdata: Statens Serum Institut updates on a daily basis the numbers of infected, tested, hospitalized, deaths etc in Denmark. Some of these numbers are broken down by gender, age and regions. Notice that all data can be downloaded as CSV-files.

ICLdata: Imperial College London shares all data and code for all their published research.

JHdata: Johns Hopkins data resources. They collect data on covid-19 from all over the world.

Literature

There is an enormous amount of literature, and new papers on covid-19 are constantly appearing. Below are some suggestions, and in the end I provide some links to pages that have many more references. You should also search for literature yourself.

SIR: Introductory paper on compartmental models, which explains well the mathematics behind the SIR and other epidemiological models. Introductory paper on stochastic compartment models, where its is described how the deterministic models arise in the limit of infinite (large) populations. Public lectures explaining the modelling are e.g. Tom Britton and Robin Thompson. This book is also relevant.

ABM: Paper on Covasim, an agent based model. The source code for Covasim is available via both the Python Package Index (via pip install covasim) and GitHub. An agent-based model keeps track of each individual in the population, and models how the epidemic evolves in the popualtion of individuals. It is much more computer intensive, it is difficult to obtain theoretical results, and there are many parameters. The advantage is that it is more realistic and can capture well heterogeneities between individuals.

SSI: Statens Serum Institut has collected all the reports and background material for their estimation of the reproduction number and the predictions of the future development of the epidemic in Denmark (all in Danish). Of particular interest is the Teknisk gennemgang af modellerne.

DTU: DTU shiny app provides information and animation of the model and the predictions made by the SSI expert group on the development of covid-19 in Denmark. It includes the source code for the simulations and predictions (in Danish).

Nature: Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe by a group from Imperial College London. You can also access the paper here.

Collider bias Collider bias undermines our understanding of COVID-19 disease risk and severity about the challenge of interpreting observational evidence from non-representative samples.

covid19analytics: R-package to load and analyze updated time series worldwide data of reported cases for the CoViD-19 from the Johns Hopkins University Center for Systems Science and Engineering. See more information here.

epidemia: The epidemia R package is a beta-version of the R package used in the Nature paper.

easyR: Easy R introduction to SIR models in R, how to simulate and make least squares parameter estimation.

EpiEstim: The EpiEstim R package that Statens Serum Institut uses to estimate the basic reproduction number. See also this page and the paper behind the package. Note that (important!) technical details can be found in the appendix.

RrReview Reproduction number (R) and growth rate (r) of the COVID-19 epidemic in the UK: methods of estimation, data sources, causes of heterogeneity, and use as a guide in policy formulation. This rapid review of the science of the reproduction number and growth rate of COVID-19 from the Royal Society is provided to assist in the understanding of COVID-19.

100R: Top 100 R resources on Novel COVID-19 Coronavirus provides lots of tools for visualization, downloading of data, and packages for analysis in R.

ICL: Imperial College London is in the forefront of modelling the corona virus, in particular, code, data and tools kan be downloaded. Here are pedagogical explanations of the relevant problems.

DELPHI: Developing the theory and practice of epidemiological forecasting from Carnegie Mellon University. They also have this epiforecast R package.

Johns Hopkins: Coronavirus Resource Center. Here, a lot of information is collected, among other things, they have this map that is being used widely by the press.

EMS: The European Mathematical Society maintains a page with links to covid-19 resources. Notice there is a list of public lectures.

ISI: International Statistical Institute also maintains a page with links to covid-19 resources.

Special Issue: On COVID-19 modelling.

Causal effect estimation and racial biases in US police force

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarifies how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The project will focus on a recent discussion on standard methods versus causal estimation of racial biases in the US police force, based on the papers EA and AR below.

The final report should include both a theoretical part and a practical data analysis using the DAT data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You can also focus on replicating (parts of) what is done in the papers EA and AR, and discuss differences, pros and contras of the ways of doing it. The data set is huge, and you should probably choose to focus only on a part of the data set. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods. You need to understand logistic regression to replicate the analyses.

Data

Data for this project is the data analyzed in the two main papers EA and AR. The research question is to understand the effect of possible racial discrimination among police officers in the use of force by the US police.

DAT: Replication code and data for the article ‘Administrative Records Mask Racially Biased Policing’

Literature

EA: An Empirical Analysis of Racial Differences in Police Use of Force by Roland G. Fryer.

AR: Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe and Jonathan Mummolo. Notice also Supplementary material.

CI: Causal Inference by Hernán MA and Robins JM.

LR: Logistic regression is described well on this wikipedia page. Any statistics book on generalized linear models can also be used, see for example a literature list at the bottom of the wikipedia page. In R, the procedure glm is useful.

CIS: Causal inference in statistics: An overview by Judea Pearl.

CIG: Causal Inference from Graphical Models by Steffen Lauritzen.

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. CIS is a good supplement outlining Judea Pearl’s way of presenting the theory, and CIG is likewise a good supplement from Steffen Lauritzen’s perspective. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is interesting for further reading on mediation and interaction.

Neuroscience: understanding synchrony in spike trains

This project is on estimation of synchrony in the dynamics of the electric activity in multiple recorded neurons, called spike trains, by the use of loglinear point process models.

A spike train is a sequence of recorded times at which a neuron fires an action potential. Each spike may be considered to occur at a single point in time. Sequences of such spike times form spike trains. The total duration of a recorded spike train can range from less than a second to many minutes or even, in chronic recordings, to many days. Spike trains are considered to be the primary mode of information transmission in the nervous system.

There is much scientific interest in identifying synchrony, meaning events across two or more neurons that are nearly simultaneous at the time scale of the recordings, since it is believed to be important for the brain function. A natural statistical approach is to discretize time, using short time bins, and to introduce loglinear models for dependency among neurons.

Data

Data: Data for this project comes from the paper RSOS and is in the zip-file available here. The data set comes in many MatLab files. If you have access to MatLab you can load the data directly.

readdata: R program for reading and understanding the data together with the data set. The variables are described here. In short, you should download the package R.matlab, and the following code will load one of the data sets:

library(R.matlab)

dat1 <- readMat(“data/mj081024a_4_0_ev.mat”)

However, that file is difficult to understand (it is easier in MatLab). The R-code above is for easier understanding of the data set.

Literature

KKL: The paper by Kass, Kelly and Loh suggests a method to assess synchrony im multiple spike trains using loglinear point process models.

GLMCC: This paper suggests a method for reconstructing neuronal circuitry from parallel spike trains also using generalized linear models. The Python code is available here.

CoNNECT: This paper suggests to use a convolutional neural network (machine learning) for estimating synaptic connectivity from spike trains. It has a Web application. The Python code is available here.

KEB: The book Analysis of Neural Data by Robert Kass, Uri Eden and Emery Brown contains all the needed theory. In particular, this chapter provides the background for the loglinear regression, and this chapter provides the backgound for point process models.

TEFDB: This paper by Truccolo, Eden, Fellows, Donoghue and Brown introduces the statistical framework for the point process models in neuroscience.

T: This paper by Brown, Barbieri, Ventura, Kass and Frank describes the time-rescaling theorem for doing model control.

RSOS: The data set for this project is taken from the following paper. Section 2.1 and Figure 2 are relevant, and the data can be accessed from the Supplemental Material. The data was also analyzed in the following paper.

Bachelor projects in statistics

Susanne Ditlevsen

August, 2021

Formalities

Regarding the contract

More information

Prerequisites

Overall objective

General advice

Projects

Predicting the dynamics of covid-19

Data

Literature

Causal effect estimation and racial biases in US police force

Data

Literature

Neuroscience: understanding synchrony in spike trains

Data

Literature