Formalities
- Regarding the contract
- More information
Prerequisites
Overall objective
General advice
Projects
- Predicting the dynamics of covid-19
  - Data
  - Literature
- Causal effect estimation and racial biases in US police force
  - Data
  - Literature

This document outlines two thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen.

There will be an info meeting on Friday, August 28, 15.00-16.00, in aud. 2 in the August Krogh Building (AKB)

Previous years project proposals are available for the spring 2020, the spring 2019 and the fall 2019.

Formalities

The thesis is written during block 1 and block 2, 2020/2021. The start date is August 31 and the thesis is handed in on January 15. There is a subsequent oral defense.

The thesis can be written in Danish or English.
It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
You should be signed up via Selvbetjeningen.
You will have to decide which project you will work on by August 31.
You will have to send me a proposed title and description of your project by September 1 at 16h (and I will give you feedback).
You will have to fill out and submit the contract before September 3.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to me, and I will submit it with my approval. You do not need my signature. Here follows some information that needs to go into the contract.

The meeting frequency will be once every second week for two hours during block 1 and once every second week for 45 min. during block 2. The block 1 meetings will be in groups. Here, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 2 meetings will be individual meetings by default. There will be four group meetings (for each subject) and three individual meetings in total. The first group meeting will be in week two of the blok, so you have the first week to read and prepare for the presentation at the group meeting. It is still unknown whether meetings will be onsite or on zoom. I hope to make group meetings onsite, and individual meetings will probably be on zoom.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are the course Mathematical Statistics (or Statistics 1 and 2, or equivalent), but you are also expected to be interested in the following:

carry out data analysis and model validation on real data
implementing models and/or data analyses (e.g. by writing R scripts)
learning to use new software packages and functions
independently read up on the background theory of the project
write a project that reflects theory as well as applications

Overall objective

The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report will be provided later.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

Predicting the dynamics of covid-19

This project aims at predicting or understanding the dynamics of covid-19. One particular problem which is very important for policy makers in the current situation is to predict number of infected and number of needed health care resources in different countries and under different scenarios of interventions, such as using face masks in public places or closing schools or work places etc.

The project should focus on some specific sub-target. The aim could for example be any of the following:

Prediction of future number of infected. Here, one could start by understanding and trying to reproduce what the expert group at Statens Serum Institute has predicted, and then investigate sensitivity to missspecified parameters or model deviations.
Estimation of the reproduction number (what Statens Serum Institut calls “kontakttallet”, the number of persons that an infected person infects on average), which changes over time for example due to changed social codes for contacts or societal measures.
Estimation of the dark figure (“mørketallet”, the proportion of supposedly immune in the population) based on non-randomised data.
Estimation of disease specific parameters, such as the distribution of the latent period (period from infection until onset of symptoms), infectious period, infectiosness, proportion of infected without/with minor/with severe symptoms etc.
Parameter estimation in compartment models - which parameters can be identified, what data is necessary etc.
The project can also focus on deviations from the standard SEIR-model, such as models that include super spreaders, see e.g. this paper, which has attracted a lot of media attention.

Within each subject, the project can focus more on theoretical development and simulations, or analysis of data, and one can choose to look at subsets of data (only Danish data, or data from some specific country, either because the epidemia is more severe there or because better data is available, or worldwide data).

The first part of this project will be like a journal club consisting of reading some papers on epidemiological models and different inference tools.

Data

There is a lot of data sources about covid-19, and many of them are being updated on a daily or weekly basis. Depending on the problem you choose to focus on in your project different data sets might be relevant. Here are some main sources giving number of effected, tested, hospitalized, deaths etc. Depending on your project, you should probably only choose one of these data sources - or you can also find your own data, since most research published on covid-19 includes access to data.

SSIdata: Statens Serum Institut opdates on a daily basis the numbers of infected, tested, hospitalized, deaths etc in Denmark. Some of these numbers are broken down by gender, age and regions. Notice that all data can be downloaded as CSV-files. You can also see the numbers from Sundhedsstyrelsen (hopefully they agree with Statens Serum Institut).

ICLdata: Imperial College London shares all data and code for all their published research.

JHdata: Johns Hopkins data resources. They collect data on covid-19 from all over the world.

Literature

There is an enormous amount of literature, and new papers on covid-19 are constantly appearing. Below are some suggestions, and in the end I provide some links to pages that have many more references.

SIR: Introductory paper on compartmental models, which explains well the mathematics behind the SIR and other epidemiological models. Here is a historical overview of this type of models. Public lectures explaining the modelling are e.g. Tom Britton and Robin Thompson. This book might also be relevant, or this book.

SSI: Statens Serum Institut has collected all the reports and background material for their estimation of the reproduction number and the predictions of the future development of the epidemic in Denmark (all in Danish). Of particular interest is the Teknisk gennemgang af modellerne.

DTU: DTU shiny app provides information and animation of the model and the predictions made by the SSI expert group on the development of covid-19 in Denmark. It includes the source code for the simulations and predictions (in Danish).

Nature: Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe by a group from Imperial College London. You can also access the paper here.

epidemia: The epidemia R package is a beta-version of the R package used in the Nature paper.

easyR: Easy R introduction to SIR models in R, how to simulate and make least squares parameter estimation.

EpiEstim: The EpiEstim R package that Statens Serum Institut uses to estimate the basic reproduction number. See also this page and the paper behind the package.

100R: Top 100 R resources on Novel COVID-19 Coronavirus provides lots of tools for visualization, downloading of data, and packages for analysis in R.

ICL: Imperial College London is in the forefront of modelling the corona virus, in particular, code, data and tools kan be downloaded. Here are pedagogical explanations of the relevant problems.

DELPHI: Developing the theory and practice of epidemiological forecasting from Carnegie Mellon University. They also have this epiforecast R package.

Johns Hopkins: Coronavirus Resource Center. Here, a lot of information is collected, among other things, they have this map that is being used widely by the press.

EMS: The European Mathematical Society maintains a page with links to covid-19 resources. Notice there is a list of public lectures.

ISI: International Statistical Institute also maintains a page with links to covid-19 resources.

Special Issue: On COVID-19 modelling.

Causal effect estimation and racial biases in US police force

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarify how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The project will focus on a recent discussion on standard methods versus causal estimation of racial biases in the US police force, based on the papers EA and AR below.

The final report should include both a theoretical part and a practical data analysis using the DAT data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You can also focus on replicating (parts of) what is done in the papers EA and AR, and discuss differences, pros and contras of the ways of doing it. The data set is huge, and you should probably choose to focus only on a part of the data set. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods.

Data

Data for this project is the data analyzed in the two main papers EA and AR. The research question is to understand the effect of possible racial discrimination among police officers in the use of force by the US police.

DAT: Replication code and data for the article ‘Administrative Records Mask Racially Biased Policing’

Literature

EA: An Empirical Analysis of Racial Differences in Police Use of Force by Roland G. Fryer.

AR: Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe and Jonathan Mummolo. Notice also Supplementary material.

CI: Causal Inference by Hernán MA and Robins JM.

CIS: Causal inference in statistics: An overview by Judea Pearl.

CIG: Causal Inference from Graphical Models by Steffen Lauritzen.

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. CIS is a good supplement outlining Judea Pearl’s way of presenting the theory, and CIG is likewise a good supplement from Steffen Lauritzen’s perspective. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is interesting for further reading on mediation and interaction.

Bachelor’s thesis in statistics

Susanne Ditlevsen

August 28, 2020

Formalities

Regarding the contract

More information

Prerequisites

Overall objective

General advice

Projects

Predicting the dynamics of covid-19

Data

Literature

Causal effect estimation and racial biases in US police force

Data

Literature