This document outlines two thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen. If there are less than 5 students, only one project will be offered. Which one depends on the interests of the majority.

**There will be an info meeting on Monday, August 29,
13.15-14.45, in Auditorium 6 at HCØ.**

The thesis is written during block 1 and block 2, 2022. The start
date is **September 5** and the thesis is handed in on
**January 13**. There is a subsequent oral defense.

- It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
- You should be signed up via Selvbetjeningen.
- You will have to decide which project you will work on by August 31 (email: susanne@math.ku.dk) - then it will be decided if all/which projects will be offered.
- You have to send a proposed title and description of your project by September 5 at 15h (and you will get feedback).
- You will have to fill out and submit the contract before September 8.

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. We will then read and comment on your proposal and approve it afterwards. Fill out the contract formular, send the pdf to susanne@math.ku.dk, and we will submit it. Remember to sign the contract. Here follows some information that needs to go into the contract.

The meeting frequency will be in groups once every second week for 1,5 hours during block 1 and in block 2 there will be individual meetings. In the block 1 group meetings, you (the students) will present some of the background literature and theory, and we will have time for questions, both general questions as well as questions specific to what we are reading. The block 2 meetings will be individual meetings by default. There will be 4 group meetings (for each subject) and 3 individual meetings of 30 minutes. The first group meeting will be in week two of the block, so you have the first week to read and prepare for the presentation at the group meeting. The individual meetings can be onsite or on zoom.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As supervisors we will be prepared to help you with technical
questions as well as more general questions that you may have prepared.
For the group meetings we can discuss general background knowledge, and
we can also discuss *ad hoc* exercises if that is relevant. For
the individual meetings you are welcome to send questions or samples of
text for us to read and provide feedback on before the meeting. Note
that we will generally *not* be able to find bugs in your R
code.

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

The formal prerequisites are the courses *Statistical Methods*
and *Mathematical Statistics* (or equivalent), but you are also
expected to be interested in the following:

- carry out data analysis and model validation on real data
- implement models and/or data analyses (e.g. by writing R scripts)
- learn to use new software packages and functions
- find relevant literature
- independently read up on the background theory of the project
- write a project that reflects theory as well as applications

The overall objective of the projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data. It also includes that you search for relevant literature.

You are encouraged to use R Markdown (and perhaps also Tidyverse as
described in R4DS) to organize data analysis, simulations and other
practical computations. But you *should not* hand in the raw
result of such a document. That document should serve as a log of your
activities and help you carry out reproducible analysis. The final
report should be written as an independent document. Guidance on how to
write the report can be found in the following references.

**GL**: Guidelines to
writing a thesis in statistics by Björn Andersson, Shaobo Jin and
Fan Yang-Wallentin from the Department of Statistics, Uppsala
University. However, note that these are recommendations to help you,
they are **NOT** requirements, in particular: You can use
any reference style, and you should not make 1.5 line spacing, but 1
line spacing.

**R4DS**: R for Data
Science by Garrett Grolemund and Hadley Wickham

**RMD**: R Markdown: The Definitive
Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Probably the following document can help you: Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Arctic species are under threats of global warming due to rapid warming water. For bowhead whales it is particularly challenging because they stay in the Arctic the entire year. This project is about analysing data from 99 bowhead whales in Baffin Bay – East Greenland. They were tagged during an 11-year period between 2001 and 2011 and their positions were regularly determined through Argos measurements. Further 18 whales were tagged between 2017 and 2019 and their positions were determined with the more precice GPS measurements. The main goal is to investigate the effect of sea surface temperature (SST) on their spatial distribution by the use of Markov models. We can discuss relevant questions to study with the biologists that collected the data.

The main idea is to discretize the area (according to longitude and
latitude) into cells, and then model the movements between cells as a
Markov process in dependence of the surface temperatures and possibly
other enviromental covariates in the different cells. This should be
related to the time evolution of the increasing average temperatures
within cells, to understand the climate effects on the movements. It is
also possible to use the state-space model approach suggested in paper
**StateSpaceModel** below.

A link to the data will be provided if the project is chosen.

**Copernicus**: Data on enviromental
covariates can be downloaded here, measured by the satellite system
of Copernicus Climate Change Service.

Let me know if you cannot access some paper (but it should be possible if you sit at HCØ when downloading it).

**BowheadWhale**: Sea surface
temperature predicts the movements of an Arctic cetacean: the bowhead
whale. This paper presents the data and a statistical analysis of
the data. The figures in the paper provides a good introduction to the
data.

**ReviewPaper**: Statistical
modelling of individual animal movement: an overview of key methods and
a discussion of practical challenges.

**StateSpaceModel**: State–space
models of individual animal movement is an older paper than the
previous, and probably has some repetition.

**Tutorials**: GitHub
page Here you can find tutorials on Hidden Markov models and on
state Space models from one of the authors of the R-package
foieGras.

**HMMbook**: The book
by Zucchini, MacDonald and Langrock (2016): Hidden Markov Models for
Time Series, Second Edition. CRC Press.

**CTCRW**: Paper
on Continuous-time correlated random walk model for animal telemetry
data, by Johnson et al. (2008).

**moveHMM**: The
R-package moveHMM provides tools for animal movement modelling using
hidden Markov models. These include processing of tracking data, fitting
hidden Markov models to movement data, visualization of data and fitted
model, decoding of the state process. This is the manual.
This paper
is also useful. Here is a guide
for choosing initial parameters.

**momentuHMM**: The
R-package momentuHMM. Extended tools for analyzing telemetry data
using generalized hidden Markov models. Features of momentuHMM
(pronounced “momentum”) include data pre-processing and visualization,
fitting HMMs to location and auxiliary biotelemetry or environmental
data, biased and correlated random walk movement models, hierarchical
HMMs, multiple imputation for incorporating location measurement error
and missing data, user-specified design matrices and constraints for
covariate modelling of parameters, random effects, decoding of the state
process, visualization of fitted models, model checking and selection,
and simulation. This paper
by McClintock and Michelot (2018) is also useful.

**crawl** The
R-package crawl. Fit continuous-time correlated random walk models
with time indexed covariates to animal telemetry data. The model is fit
using the Kalman-filter on a state space version of the continuous-time
stochastic movement process. This guide is
useful.

**foieGras** The
R-package foieGras. Fit Continuous-Time State-Space and Latent
Variable Models for Quality Control of Argos Satellite (and Other)
Telemetry Data and for Estimating Movement Behaviour. This paper
is also useful.

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarifies how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The project will focus on a recent discussion on standard methods
versus causal estimation of racial biases in the US police force, based
on the papers **EA** and **AR** below.

The final report should include both a theoretical part and a
practical data analysis using the **DAT** data below.
Several causal models could be considered and several different methods
for estimation of a causal effect could be used. You can also focus on
replicating (parts of) what is done in the papers **EA**
and **AR**, and discuss differences, pros and contras of
the ways of doing it. The data set is huge, and you should probably
choose to focus only on a part of it. You need to make a selection and
present the relevant theory. Simulations using model examples derived
from the data should be considered and used to investigate different
methods. You need to understand logistic regression to replicate the
analyses.

Data for this project is the data analyzed in the two main papers
**EA** and **AR**. The research question is to
understand the effect of possible racial discrimination among police
officers in the use of force by the US police.

**DAT**: Replication
code and data for the article ‘Administrative Records Mask Racially
Biased Policing’

**EA**: An
Empirical Analysis of Racial Differences in Police Use of Force by
Roland G. Fryer.

**AR**: Administrative
Records Mask Racially Biased Policing by Dean Knox, Will Lowe and
Jonathan Mummolo. Notice also Supplementary
material.

**ECI**: Elements
of Causal Inference by Jonas Peters, Dominik Janzing and Bernhard
Scholkopf.

**CI**: Causal
Inference for Statistics, Social and Biomedical Sciences by Guido
Imbens and Donald Rubin.

**LR**: Logistic regression is described well on this wikipedia
page. Any statistics book on generalized linear models can also be
used, see for example a literature list at the bottom of the wikipedia
page. In R, the procedure glm
is useful.

**CISCM**: Causal inference
in statistics: An overview by Judea Pearl.

**CIPO** Causal
Inference Using Potential Outcomes: Design, Modeling, Decisions by
Donald Rubin.

There are two modeling frameworks in causality: Strcutrual causal
models (SCMs) described in **ECI** and potential outcome
(PO) models discussed in **CI**. For this project, you can
choose the framework you prefer. In **ECI** the most
relevant part is chapter 6 and in particular the discssion on covariate
adjustment. In **CI** the model is explained in chapter 3
and the relevant methods in chapter 12 and following.

CISCM is a good supplement outlining the SCM way of presenting the theory and CIPO gives a short overview of the PO perspective.