This document outlines three thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen. If there are less than 5 students, only one project will be offered, and if there are less than 10 students, only two projects will be offered. Which ones depend on the interests of the majority.
There will be an info meeting on Wednesday, January 13, 10.00-12.00, on zoom
The thesis is written during block 3 and block 4, 2021. The start date is February 8 and the thesis is handed in on June 11. There is a subsequent oral defense.
Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to me, and I will submit it with my signature. Remember to sign the contract. Here follows some information that needs to go into the contract.
The meeting frequency will be in groups once every second week for 1,5 hours during block 3 and in block 4 there will be individual meetings. In the block 3 group meetings, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 4 meetings will be individual meetings by default. There will be 4 group meetings (for each subject) and 3-4 individual meetings of 30 minutes in total. The first group meeting will be in week two of the blok, so you have the first week to read and prepare for the presentation at the group meeting. It is still unknown whether meetings will be onsite or on zoom.
As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).
As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.
The formal prerequisites are the course Mathematical Statistics (or Statistics 1 and 2, or equivalent), but you are also expected to be interested in the following:
The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data. It also includes that you search for relevant literature.
You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report can be found in the following references.
GL: Guidelines to writing a thesis in statistics by Björn Andersson, Shaobo Jin and Fan Yang-Wallentin from the Department of Statistics, Uppsala University. However, note that these are recommendations, not requirements, in particular: You can use any reference style, and you should not make 1.5 line spacing, but 1 line spacing.
R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham
RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund
Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.
This project aims at predicting or understanding the dynamics of covid-19. One particular problem which is very important for policy makers in the current situation is to predict number of infected and number of needed health care resources in different countries and under different scenarios of interventions, such as using face masks in public places or closing schools or work places etc.
The project should focus on some specific sub-target. The aim could for example be any of the following:
Prediction of future number of infected. Here, one could start by understanding and trying to reproduce what the expert group at Statens Serum Institute has predicted, and then investigate sensitivity to missspecified parameters or model deviations.
Estimation of the reproduction number (what Statens Serum Institut calls “kontakttallet”, the number of persons that an infected person infects on average), which changes over time for example due to changed social codes for contacts or societal measures.
Estimation of the dark figure (“mørketallet”, the proportion of supposedly immune in the population) based on non-randomised data.
Estimation of disease specific parameters, such as the distribution of the latent period (period from infection until onset of symptoms), infectious period, infectiosness, proportion of infected without/with minor/with severe symptoms etc.
Parameter estimation in compartment models - which parameters can be identified, what data is necessary etc.
The project can also focus on deviations from the standard SEIR-model, such as models that include super spreaders, see e.g. this paper, which has attracted a lot of media attention.
Within each subject, the project can focus more on theoretical development and simulations, or analysis of data, and one can choose to look at subsets of data (only Danish data, or data from some specific country, either because the epidemia is more severe there or because better data is available, or worldwide data).
The first part of this project will be like a journal club consisting of reading some papers on epidemiological models and different inference tools.
There is a lot of data sources about covid-19, and many of them are being updated on a daily or weekly basis. Depending on the problem you choose to focus on in your project different data sets might be relevant. Here are some main sources giving number of effected, tested, hospitalized, deaths etc. Depending on your project, you should probably only choose one of these data sources - or you can also find your own data, since most research published on covid-19 includes access to data.
SSIdata: Statens Serum Institut updates on a daily basis the numbers of infected, tested, hospitalized, deaths etc in Denmark. Some of these numbers are broken down by gender, age and regions. Notice that all data can be downloaded as CSV-files.
ICLdata: Imperial College London shares all data and code for all their published research.
JHdata: Johns Hopkins data resources. They collect data on covid-19 from all over the world.
There is an enormous amount of literature, and new papers on covid-19 are constantly appearing. Below are some suggestions, and in the end I provide some links to pages that have many more references. You should also search for literature yourself.
SIR: Introductory paper on compartmental models, which explains well the mathematics behind the SIR and other epidemiological models. Introductory paper on stochastic compartment models, where its is described how the deterministic models arise in the limit of infinite (large) populations. Public lectures explaining the modelling are e.g. Tom Britton and Robin Thompson. This book is also relevant.
ABM: Paper on Covasim, an agent based model. The source code for Covasim is available via both the Python Package Index (via pip install covasim) and GitHub. An agent-based model keeps track of each individual in the population, and models how the epidemic evolves in the popualtion of individuals. It is much more computer intensive, it is difficult to obtain theoretical results, and there are many parameters. The advantage is that it is more realistic and can capture well heterogeneities between individuals.
SSI: Statens Serum Institut has collected all the reports and background material for their estimation of the reproduction number and the predictions of the future development of the epidemic in Denmark (all in Danish). Of particular interest is the Teknisk gennemgang af modellerne.
DTU: DTU shiny app provides information and animation of the model and the predictions made by the SSI expert group on the development of covid-19 in Denmark. It includes the source code for the simulations and predictions (in Danish).
Nature: Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe by a group from Imperial College London. You can also access the paper here.
covid19analytics: R-package to load and analyze updated time series worldwide data of reported cases for the CoViD-19 from the Johns Hopkins University Center for Systems Science and Engineering. See more information here.
epidemia: The epidemia R package is a beta-version of the R package used in the Nature paper.
easyR: Easy R introduction to SIR models in R, how to simulate and make least squares parameter estimation.
EpiEstim: The EpiEstim R package that Statens Serum Institut uses to estimate the basic reproduction number. See also this page and the paper behind the package. Note that (important!) technical details can be found in the appendix.
100R: Top 100 R resources on Novel COVID-19 Coronavirus provides lots of tools for visualization, downloading of data, and packages for analysis in R.
DELPHI: Developing the theory and practice of epidemiological forecasting from Carnegie Mellon University. They also have this epiforecast R package.
EMS: The European Mathematical Society maintains a page with links to covid-19 resources. Notice there is a list of public lectures.
ISI: International Statistical Institute also maintains a page with links to covid-19 resources.
Special Issue: On COVID-19 modelling.
This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.
The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarifies how we actually want to use linear models (and other methods) in practice to estimate causal effects.
The project will focus on a recent discussion on standard methods versus causal estimation of racial biases in the US police force, based on the papers EA and AR below.
The final report should include both a theoretical part and a practical data analysis using the DAT data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You can also focus on replicating (parts of) what is done in the papers EA and AR, and discuss differences, pros and contras of the ways of doing it. The data set is huge, and you should probably choose to focus only on a part of the data set. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods. You need to understand logistic regression to replicate the analyses.
Data for this project is the data analyzed in the two main papers EA and AR. The research question is to understand the effect of possible racial discrimination among police officers in the use of force by the US police.
DAT: Replication code and data for the article ‘Administrative Records Mask Racially Biased Policing’
EA: An Empirical Analysis of Racial Differences in Police Use of Force by Roland G. Fryer.
AR: Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe and Jonathan Mummolo. Notice also Supplementary material.
CI: Causal Inference by Hernán MA and Robins JM.
LR: Logistic regression is described well on this wikipedia page. Any statistics book on generalized linear models can also be used, see for example a literature list at the bottom of the wikipedia page. In R, the procedure glm is useful.
CIS: Causal inference in statistics: An overview by Judea Pearl.
CIG: Causal Inference from Graphical Models by Steffen Lauritzen.
CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. CIS is a good supplement outlining Judea Pearl’s way of presenting the theory, and CIG is likewise a good supplement from Steffen Lauritzen’s perspective. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is interesting for further reading on mediation and interaction.
NOTE THAT A DATA SET HAS BEEN ADDED.
This project is on estimation of synchrony in the dynamics of the electric activity in multiple recorded neurons, called spike trains, by the use of loglinear point process models.
A spike train is a sequence of recorded times at which a neuron fires an action potential. Each spike may be considered to occur at a single point in time. Sequences of such spike times form spike trains. The total duration of a recorded spike train can range from less than a second to many minutes or even, in chronic recordings, to many days. Spike trains are considered to be the primary mode of information transmission in the nervous system.
There is much scientific interest in identifying synchrony, meaning events across two or more neurons that are nearly simultaneous at the time scale of the recordings, since it is believed to be important for the brain function. A natural statistical approach is to discretize time, using short time bins, and to introduce loglinear models for dependency among neurons.
Data Data for this project comes from the paper RSOS and is in the zip-file available here. The data set comes in many MatLab files. If you have access to MatLab you can already now look into the data. I will upload a small R program to read and understand the data in R within the next week. For a start, you should download the package R.matlab, and the following code will load one of the data sets:
dat1 <- readMat(“data/mj081024a_4_0_ev.mat”)
However, that file is difficult to understand (it is easier in MatLab). I will update the R-code for easier understanding of the data set.
KKL: The paper by Kass, Kelly and Loh suggests a method to assess synchrony im multiple spike trains using loglinear point process models.
KEB: The book Analysis of Neural Data by Robert Kass, Uri Eden and Emery Brown contains all the needed theory. In particular, this chapter provides the background for the loglinear regression, and this chapter provides the backgound for point process models.
TEFDB: This paper by Truccolo, Eden, Fellows, Donoghue and Brown introduces the statistical framework for the point process models in neuroscience.
T: This paper by Brown, Barbieri, Ventura, Kass and Frank describes the time-rescaling theorem for doing model control.
RSOS: The data set for this project is taken from the following paper. Section 2.1 and Figure 2 are relevant, and the data can be accessed from the Supplemental Material. The data was also analyzed in the following paper.