Formalities
- Regarding the contract
- More information
Prerequisites
Overall objective
General advice
Projects
- The effect of global warming and sea surface temperatures on the movements of bowhead whales in the Arctic
- Causal effect estimation and racial biases in US police force
  - Data
  - Literature

This document outlines two thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen. If there are less than 5 students, only one project will be offered. Which one depends on the interests of the majority.

There will be an info meeting on Monday, August 29, 13.15-14.45, in Auditorium 6 at HCØ.

Formalities

The thesis is written during block 1 and block 2, 2022. The start date is September 5 and the thesis is handed in on January 13. There is a subsequent oral defense.

It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
You should be signed up via Selvbetjeningen.
You will have to decide which project you will work on by August 31 (email: susanne@math.ku.dk) - then it will be decided if all/which projects will be offered.
You have to send a proposed title and description of your project by September 5 at 15h (and you will get feedback).
You will have to fill out and submit the contract before September 8.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. We will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to susanne@math.ku.dk, and we will submit it. Remember to sign the contract. Here follows some information that needs to go into the contract.

The meeting frequency will be in groups once every second week for 1,5 hours during block 1 and in block 2 there will be individual meetings. In the block 1 group meetings, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 2 meetings will be individual meetings by default. There will be 4 group meetings (for each subject) and 3 individual meetings of 30 minutes. The first group meeting will be in week two of the block, so you have the first week to read and prepare for the presentation at the group meeting. The individual meetings can be onsite or on zoom.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As supervisors we will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for us to read and provide feedback on before the meeting. Note that we will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are the courses Statistical Methods and Mathematical Statistics (or equivalent), but you are also expected to be interested in the following:

carry out data analysis and model validation on real data
implement models and/or data analyses (e.g. by writing R scripts)
learn to use new software packages and functions
find relevant literature
independently read up on the background theory of the project
write a project that reflects theory as well as applications

Overall objective

The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data. It also includes that you search for relevant literature.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report can be found in the following references.

GL: Guidelines to writing a thesis in statistics by Björn Andersson, Shaobo Jin and Fan Yang-Wallentin from the Department of Statistics, Uppsala University. However, note that these are recommendations to help you, they are NOT requirements, in particular: You can use any reference style, and you should not make 1.5 line spacing, but 1 line spacing.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

The effect of global warming and sea surface temperatures on the movements of bowhead whales in the Arctic

Arctic species are under threats of global warming due to rapid warming water. For bowhead whales it is particularly challenging because they stay in the Arctic the entire year. This project is about analysing data from 99 bowhead whales in Baffin Bay – East Greenland. They were tagged during an 11-year period between 2001 and 2011 and their positions were regularly determined through Argos measurements. Further 18 whales were tagged between 2017 and 2019 and their positions were determined with the more precice GPS measurements. The main goal is to investigate the effect of sea surface temperature (SST) on their spatial distribution by the use of Markov models. We can discuss relevant questions to study with the biologists that collected the data.

The main idea is to discretize the area (according to longitude and latitude) into cells, and then model the movements between cells as a Markov process in dependence of the surface temperatures and possibly other enviromental covariates in the different cells. This should be related to the time evolution of the increasing average temperatures within cells, to understand the climate effects on the movements. It is also possible to use the state-space model approach suggested in paper StateSpaceModel below.

Data

A link to the data will be provided if the project is chosen.

Copernicus: Data on enviromental covariates can be downloaded here, measured by the satellite system of Copernicus Climate Change Service.

Literature

Let me know if you cannot access some paper (but it should be possible if you sit at HCØ when downloading it).

BowheadWhale: Sea surface temperature predicts the movements of an Arctic cetacean: the bowhead whale. This paper presents the data and a statistical analysis of the data. The figures in the paper provides a good introduction to the data.

ReviewPaper: Statistical modelling of individual animal movement: an overview of key methods and a discussion of practical challenges.

StateSpaceModel: State–space models of individual animal movement is an older paper than the previous, and probably has some repetition.

Tutorials: GitHub page Here you can find tutorials on Hidden Markov models and on state Space models from one of the authors of the R-package foieGras.

HMMbook: The book by Zucchini, MacDonald and Langrock (2016): Hidden Markov Models for Time Series, Second Edition. CRC Press.

CTCRW: Paper on Continuous-time correlated random walk model for animal telemetry data, by Johnson et al. (2008).

R-packages

moveHMM: The R-package moveHMM provides tools for animal movement modelling using hidden Markov models. These include processing of tracking data, fitting hidden Markov models to movement data, visualization of data and fitted model, decoding of the state process. This is the manual. This paper is also useful. Here is a guide for choosing initial parameters.

momentuHMM: The R-package momentuHMM. Extended tools for analyzing telemetry data using generalized hidden Markov models. Features of momentuHMM (pronounced “momentum”) include data pre-processing and visualization, fitting HMMs to location and auxiliary biotelemetry or environmental data, biased and correlated random walk movement models, hierarchical HMMs, multiple imputation for incorporating location measurement error and missing data, user-specified design matrices and constraints for covariate modelling of parameters, random effects, decoding of the state process, visualization of fitted models, model checking and selection, and simulation. This paper by McClintock and Michelot (2018) is also useful.

crawl The R-package crawl. Fit continuous-time correlated random walk models with time indexed covariates to animal telemetry data. The model is fit using the Kalman-filter on a state space version of the continuous-time stochastic movement process. This guide is useful.

foieGras The R-package foieGras. Fit Continuous-Time State-Space and Latent Variable Models for Quality Control of Argos Satellite (and Other) Telemetry Data and for Estimating Movement Behaviour. This paper is also useful.

Causal effect estimation and racial biases in US police force

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarifies how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The project will focus on a recent discussion on standard methods versus causal estimation of racial biases in the US police force, based on the papers EA and AR below.

The final report should include both a theoretical part and a practical data analysis using the DAT data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You can also focus on replicating (parts of) what is done in the papers EA and AR, and discuss differences, pros and contras of the ways of doing it. The data set is huge, and you should probably choose to focus only on a part of it. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods. You need to understand logistic regression to replicate the analyses.

Data

Data for this project is the data analyzed in the two main papers EA and AR. The research question is to understand the effect of possible racial discrimination among police officers in the use of force by the US police.

DAT: Replication code and data for the article ‘Administrative Records Mask Racially Biased Policing’

Literature

EA: An Empirical Analysis of Racial Differences in Police Use of Force by Roland G. Fryer.

AR: Administrative Records Mask Racially Biased Policing by Dean Knox, Will Lowe and Jonathan Mummolo. Notice also Supplementary material.

ECI: Elements of Causal Inference by Jonas Peters, Dominik Janzing and Bernhard Scholkopf.

CI: Causal Inference for Statistics, Social and Biomedical Sciences by Guido Imbens and Donald Rubin.

LR: Logistic regression is described well on this wikipedia page. Any statistics book on generalized linear models can also be used, see for example a literature list at the bottom of the wikipedia page. In R, the procedure glm is useful.

CISCM: Causal inference in statistics: An overview by Judea Pearl.

CIPO Causal Inference Using Potential Outcomes: Design, Modeling, Decisions by Donald Rubin.

There are two modeling frameworks in causality: Strcutrual causal models (SCMs) described in ECI and potential outcome (PO) models discussed in CI. For this project, you can choose the framework you prefer. In ECI the most relevant part is chapter 6 and in particular the discssion on covariate adjustment. In CI the model is explained in chapter 3 and the relevant methods in chapter 12 and following.

CISCM is a good supplement outlining the SCM way of presenting the theory and CIPO gives a short overview of the PO perspective.

Bachelor projects in statistics

Susanne Ditlevsen and Niklas Pfister

September, 2022

Formalities

Regarding the contract

More information

Prerequisites

Overall objective

General advice

Projects

The effect of global warming and sea surface temperatures on the movements of bowhead whales in the Arctic

Data

Literature

R-packages

Causal effect estimation and racial biases in US police force

Data

Literature