Formalities
- Regarding the contract
- More information
Prerequisites
Overall objective
General advice
Projects

This document outlines four thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen. If there are less than 5 students, only one project will be offered, and if there are less than 10 students, only two projects will be offered. Which ones depend on the interests of the majority.

There will be an info meeting on Monday, January 31, 10.15-11.45, in Auditorium 9 at HCØ.

Formalities

The thesis is written during block 3 and block 4, 2022. The start date is February 7 and the thesis is handed in on June 10. There is a subsequent oral defense.

The thesis can be written in Danish or English.
It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
You should be signed up via Selvbetjeningen.
You will have to decide which project you will work on by February 1 (email: susanne@math.ku.dk) - then it will be decided if all/which projects will be offered.
You will have to send me a proposed title and description of your project by February 7 at 15h (and I will give you feedback).
You will have to fill out and submit the contract before February 10.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to me, and I will submit it with my signature. Remember to sign the contract. Here follows some information that needs to go into the contract.

The meeting frequency will be in groups once every second week for 1,5 hours during block 1 and in block 2 there will be individual meetings. In the block 1 group meetings, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 2 meetings will be individual meetings by default. There will be 4 group meetings (for each subject) and 3 individual meetings of 30 minutes. The first group meeting will be in week two of the blok, so you have the first week to read and prepare for the presentation at the group meeting. For now it seems that onsite meetings are possible. The individual meetings can be onsite or on zoom.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are the course Mathematical Statistics (or Statistics 1 and 2, or equivalent), but you are also expected to be interested in the following:

carry out data analysis and model validation on real data
implement models and/or data analyses (e.g. by writing R scripts)
learn to use new software packages and functions
find relevant literature
independently read up on the background theory of the project
write a project that reflects theory as well as applications

Overall objective

The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data. It also includes that you search for relevant literature.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report can be found in the following references.

GL: Guidelines to writing a thesis in statistics by Björn Andersson, Shaobo Jin and Fan Yang-Wallentin from the Department of Statistics, Uppsala University. However, note that these are recommendations to help you, they are NOT requirements, in particular: You can use any reference style, and you should not make 1.5 line spacing, but 1 line spacing.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

The effect of global warming and sea surface temperatures on the movements of bowhead whales in the Arctic

Arctic species are under threats of global warming due to rapid warming water. For bowhead whales it is particularly challenging because they stay in the Arctic the entire year. This project is about analysing data from 84 bowhead whales in Baffin Bay – East Greenland. They were tagged during an 11-year period between 2001 and 2011 and their positions were regularly determined through GPS measurements. The main goal is to investigate the effect of sea surface temperature (SST) on their spatial distribution by the use of Markov models. We can discuss relevant questions to study with the biologists that collected the data.

The main idea is to discretize the area (according to longitude and latitude) into cells, and then model the movements between cells as a Markov process in dependence of the surface temperatures in the different cells. This should be related to the time evolution of the increasing average temperatures within cells, to understand the climate effects on the movements. It is also possible to use the state-space model approach suggested in paper StateSpaceModel below.

Data

The data will be uploaded here soon (if the project is chosen).

Copernicus: Data on enviromental covariates can be downloaded here, measured by the satellite system of Copernicus Climate Change Service.

Literature

BowheadWhale: Sea surface temperature predicts the movements of an Arctic cetacean: the bowhead whale. This paper presents the data and a statistical analysis of the data. The figures in the paper provides a good introduction to the data.

msm: The R-package msm to fit Markov models with effects of covariates on transition matrices. This manual with worked examples is useful, apart from the standard reference manual. This paper is also very useful, even if not new (it is from when the package was first created, so it probably has many new features).

StateSpaceModel: State–space models of individual animal movement provides a review of state space models for modelling animal movement. Let me know if you cannot access the paper (but it should be possible if you sit at HCØ when downloading it).

Predicting the dynamics of covid-19

This project aims at predicting or understanding the dynamics of covid-19. One particular problem which is very important for policy makers in the current situation is to predict number of infected and number of needed health care resources in different countries and under different scenarios of interventions, such as using face masks in public places or closing schools or work places etc.

The project should focus on some specific sub-target. The aim could for example be any of the following:

Prediction of future number of infected. Here, one could start by understanding and trying to reproduce what the expert group at Statens Serum Institute has predicted, and then investigate sensitivity to missspecified parameters or model deviations.
Estimation of the reproduction number (what Statens Serum Institut calls “kontakttallet”, the number of persons that an infected person infects on average), which changes over time for example due to changed social codes for contacts or societal measures.
Estimation of the dark figure (“mørketallet”, the proportion of supposedly immune in the population) based on non-randomised data.
Estimation of disease specific parameters, such as the distribution of the latent period (period from infection until onset of symptoms), infectious period, infectiosness, proportion of infected without/with minor/with severe symptoms etc.
Parameter estimation in compartment models - which parameters can be identified, what data is necessary etc.
The project can focus on deviations from the standard SEIR-model, such as models that include super spreaders, see e.g. this paper, which has attracted a lot of media attention.
The development of the pandemic worldwide where parts of the world have massive vaccine coverage, whereas other parts have limited access to vaccines. Estimates of effects of mutants.

Within each subject, the project can focus more on theoretical development and simulations, or analysis of data, and one can choose to look at subsets of data (only Danish data, or data from some specific country, either because the epidemia is more severe there or because better data is available, or worldwide data).

The first part of this project will be like a journal club consisting of reading some papers on epidemiological models and different inference tools.

Data

There is a lot of data sources about covid-19, and many of them are being updated on a daily or weekly basis. Depending on the problem you choose to focus on in your project different data sets might be relevant. Here are some main sources giving number of effected, tested, hospitalized, deaths etc. Depending on your project, you should probably only choose one of these data sources - or you can also find your own data, since most research published on covid-19 includes access to data.

SSIdata: Statens Serum Institut updates on a daily basis the numbers of infected, tested, hospitalized, deaths etc in Denmark. Some of these numbers are broken down by gender, age and regions. Notice that all data can be downloaded as CSV-files.

ICLdata: Imperial College London shares all data and code for all their published research.

JHdata: Johns Hopkins data resources. They collect data on covid-19 from all over the world.

Literature

There is an enormous amount of literature, and new papers on covid-19 are constantly appearing. Below are some suggestions, and in the end I provide some links to pages that have many more references. You should also search for literature yourself.

SIR: Introductory paper on compartmental models, which explains well the mathematics behind the SIR and other epidemiological models. Introductory paper on stochastic compartment models, where its is described how the deterministic models arise in the limit of infinite (large) populations. Public lectures explaining the modelling are e.g. Tom Britton and Robin Thompson. This book is also relevant.

ABM: Paper on Covasim, an agent based model. The source code for Covasim is available via both the Python Package Index (via pip install covasim) and GitHub. An agent-based model keeps track of each individual in the population, and models how the epidemic evolves in the popualtion of individuals. It is much more computer intensive, it is difficult to obtain theoretical results, and there are many parameters. The advantage is that it is more realistic and can capture well heterogeneities between individuals.

SSI: Statens Serum Institut has collected all the reports and background material for their estimation of the reproduction number and the predictions of the future development of the epidemic in Denmark (all in Danish). Of particular interest is the Teknisk gennemgang af modellerne.

DTU: DTU shiny app provides information and animation of the model and the predictions made by the SSI expert group on the development of covid-19 in Denmark. It includes the source code for the simulations and predictions (in Danish).

Nature: Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe by a group from Imperial College London. You can also access the paper here.

Collider bias Collider bias undermines our understanding of COVID-19 disease risk and severity about the challenge of interpreting observational evidence from non-representative samples.

covid19analytics: R-package to load and analyze updated time series worldwide data of reported cases for the CoViD-19 from the Johns Hopkins University Center for Systems Science and Engineering. See more information here.

epidemia: The epidemia R package is a beta-version of the R package used in the Nature paper.

easyR: Easy R introduction to SIR models in R, how to simulate and make least squares parameter estimation.

EpiEstim: The EpiEstim R package that Statens Serum Institut uses to estimate the basic reproduction number. See also this page and the paper behind the package. Note that (important!) technical details can be found in the appendix.

StateSpaceModel: Estimating the time-varying reproduction number of COVID-19 with a state-space method is an interesting and more stable alternative to estimate the time varying reproduction number.

IncompleteData Regression Models for Understanding COVID-19 Epidemic Dynamics With Incomplete Data is a new paper, suggestion methods to estimation different quantities such as incidence, prevalence, and effective reproductive number, by the incomplete data that is available. Notice that there are also discussions of the paper by other authors in the journal. Let me know if you cannot access the paper (but it should be possible if you sit at HCØ when downloading it).

RrReview Reproduction number (R) and growth rate (r) of the COVID-19 epidemic in the UK: methods of estimation, data sources, causes of heterogeneity, and use as a guide in policy formulation. This rapid review of the science of the reproduction number and growth rate of COVID-19 from the Royal Society is provided to assist in the understanding of COVID-19.

100R: Top 100 R resources on Novel COVID-19 Coronavirus provides lots of tools for visualization, downloading of data, and packages for analysis in R.

ICL: Imperial College London is in the forefront of modelling the corona virus, in particular, code, data and tools kan be downloaded. Here are pedagogical explanations of the relevant problems.

DELPHI: Developing the theory and practice of epidemiological forecasting from Carnegie Mellon University. They also have this epiforecast R package.

Johns Hopkins: Coronavirus Resource Center. Here, a lot of information is collected, among other things, they have this map that is being used widely by the press.

EMS: The European Mathematical Society maintains a page with links to covid-19 resources. Notice there is a list of public lectures.

ISI: International Statistical Institute also maintains a page with links to covid-19 resources.

Causal effect estimation and racial biases in US police force

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarifies how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The project will focus on a recent discussion on standard methods versus causal estimation of racial biases in the US police force, based on the papers EA and AR below.

The final report should include both a theoretical part and a practical data analysis using the DAT data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You can also focus on replicating (parts of) what is done in the papers EA and AR, and discuss differences, pros and contras of the ways of doing it. The data set is huge, and you should probably choose to focus only on a part of the data set. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods. You need to understand logistic regression to replicate the analyses.

Data

Data for this project is the data analyzed in the two main papers EA and AR. The research question is to understand the effect of possible racial discrimination among police officers in the use of force by the US police.

DAT: Replication code and data for the article ‘Administrative Records Mask Racially Biased Policing’

Literature

EA: An Empirical Analysis of Racial Differences in Police Use of Force by Roland G. Fryer.

AR: Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe and Jonathan Mummolo. Notice also Supplementary material.

CI: Causal Inference by Hernán MA and Robins JM.

LR: Logistic regression is described well on this wikipedia page. Any statistics book on generalized linear models can also be used, see for example a literature list at the bottom of the wikipedia page. In R, the procedure glm is useful.

CIS: Causal inference in statistics: An overview by Judea Pearl.

CIG: Causal Inference from Graphical Models by Steffen Lauritzen.

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. CIS is a good supplement outlining Judea Pearl’s way of presenting the theory, and CIG is likewise a good supplement from Steffen Lauritzen’s perspective. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is interesting for further reading on mediation and interaction.

Neuroscience: understanding synchrony in spike trains

This project is on estimation of synchrony in the dynamics of the electric activity in multiple recorded neurons, called spike trains, by the use of loglinear point process models.

A spike train is a sequence of recorded times at which a neuron fires an action potential. Each spike may be considered to occur at a single point in time. Sequences of such spike times form spike trains. The total duration of a recorded spike train can range from less than a second to many minutes or even, in chronic recordings, to many days. Spike trains are considered to be the primary mode of information transmission in the nervous system.

There is much scientific interest in identifying synchrony, meaning events across two or more neurons that are nearly simultaneous at the time scale of the recordings, since it is believed to be important for the brain function. A natural statistical approach is to discretize time, using short time bins, and to introduce loglinear models for dependency among neurons.

Data

Data: Data for this project comes from the paper RSOS and is in the zip-file available here. The data set comes in many MatLab files. If you have access to MatLab you can load the data directly.

readdata: R program for reading and understanding the data together with the data set. The variables are described here. In short, you should download the package R.matlab, and the following code will load one of the data sets:

library(R.matlab)

dat1 <- readMat(“data/mj081024a_4_0_ev.mat”)

However, that file is difficult to understand (it is easier in MatLab). The R-code above is for easier understanding of the data set.

Literature

KKL: The paper by Kass, Kelly and Loh suggests a method to assess synchrony im multiple spike trains using loglinear point process models.

GLMCC: This paper suggests a method for reconstructing neuronal circuitry from parallel spike trains also using generalized linear models. The Python code is available here.

CoNNECT: This paper suggests to use a convolutional neural network (machine learning) for estimating synaptic connectivity from spike trains. It has a Web application. The Python code is available here.

KEB: The book Analysis of Neural Data by Robert Kass, Uri Eden and Emery Brown contains all the needed theory. In particular, this chapter provides the background for the loglinear regression, and this chapter provides the backgound for point process models.

TEFDB: This paper by Truccolo, Eden, Fellows, Donoghue and Brown introduces the statistical framework for the point process models in neuroscience.

T: This paper by Brown, Barbieri, Ventura, Kass and Frank describes the time-rescaling theorem for doing model control.

RSOS: The data set for this project is taken from the following paper. Section 2.1 and Figure 2 are relevant, and the data can be accessed from the Supplemental Material. The data was also analyzed in the following paper.

Bachelor projects in statistics

Susanne Ditlevsen

January, 2022

Formalities

Regarding the contract

More information

Prerequisites

Overall objective

General advice

Projects

The effect of global warming and sea surface temperatures on the movements of bowhead whales in the Arctic

Data

Literature

Predicting the dynamics of covid-19

Data

Literature

Causal effect estimation and racial biases in US police force

Data

Literature

Neuroscience: understanding synchrony in spike trains

Data

Literature