Formalities
- Regarding the contract
- More information
Prerequisites
Overall objective
General advice
Projects

This document outlines three proposed thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen.

There will be an info meeting on Tuesday 3, 15.15-16.00, in meeting room 04-3-15.

Formalities

The thesis is written during block 1 and block 2, 2019. The start date is September 2 and the thesis is handed in on January 10. There is a subsequent oral defense.

The thesis can be written in Danish or English.
It’s a 15 ECTS project and you should expect to write between 30 and 45 pages.
You should be signed up via Selvbetjeningen.
You will have to fill out and submit the contract before September 2.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. I suggest that you email me a draft of the contract, and when I have commented on it, you can drop by my office for a signature. Here follows some information that needs to go into the contract.

The meeting frequency will be once every second week for two hours during block 1 and once every second week for one hour during block 2. The block 1 meetings will be in groups. The block 2 meetings will be individual meetings by default. There will be four group meetings and three individual meetings in total.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are Measure Theory and the Statistics 1 and 2 courses, but you are also expected to be interested in the following:

carry out data analysis and model validation on real data
implementing models and/or data analyses (e.g. by writing R scripts)
learning to use new software packages and functions
independently read up on the background theory of the project
write a project that reflects theory as well as applications

Overall objective

The overall objective of all three projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report will be provided later.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

Predicting company default

Data for this project comes from the BAC 2017 case competition and was collected by the team of students from MATH, who won the competition.

Winning report

The main objective of this project is to build a predictive model of whether a company will default in the future given data on the company. A simple model can be a linear model as known from Statistics 1 and 2. One alternative is a logistic regression model, but it’s also possible to explore some models that are not in the standard curriculum of other statistics courses random forests.

IMPORTANT NOTE ABOUT THE PROJECT IN THE FALL. Many students wrote this project in the spring semester. It’s possible to write the bachelor’s thesis using this same data set in the fall of 2019, but there is one important constraint. The focus of the project must be on using Bayesian Additive Regression Trees, aka BART. This does not mean that you should not consider other methods and models at all, but a substantial part of the thesis should deal with BART.

Data

A data set for this project is available as a csv file:

DefaultDKTotal.csv

Once you have downloaded the file (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).

readr::read_csv("DefaultDKTotal.csv")

It’s a fairly large data set with 636,855 rows and 76 columns. Each row represents a company in one fiscal year. The columns represent general information about the company, values extracted from the annual report, some data related to the geographical location of the company, and some key figures computed from the annual report. Many columns have self-explaining names. The column default is binary and encodes if the company defaults during the period that the data collection covers (one means default). The variable Normal_slut gives the actual default date.

To begin with you may ignore that the same company is represented in several rows corresponding to different fiscal years, or better yet, you may restrict attention to only a subset of companies from a single year. Later in the project it’s quite important that you take into account the fact that data spans a period of several years.

For using BART with R you should consider bartMachine.

Literature

ISL: An Introduction to Statistical Learning with R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

caret: The caret Package by Max Kuhn

ESL: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

BART BART: Bayesian additive regression trees by Hugh A. Chipman, Edward I. George, and Robert E. McCulloch.

XBART XBART: Accelerated Bayesian Additive Regression Trees

Start with ISL. It contains a lot of material on how to actually work with the methods in R and several good models and methods. For a systematic way to use and evaluate predictive models in R you can consider using the caret R package. For understanding BART you need to read the BART paper. The XBART paper is a recent computational improvement that may be worth considering.

More in-depth theoretical treatments can be found in ESL, which is a modern classic.

Water Leakage detection

This project is on detection of a water leakage from flow data. If you choose this project you will get to collaborate with a small Danish start-up, Sense Analytics, that will provide the data for the project.

The company has written up a short project description describing the data and the problem.

The project falls under the heading of time series analysis as the data naturally forms a time series. It can be regarded as a change-point problem, where the objective is to find (as quickly as possible in real time) the time point when the leakage begins. Historic data will be used to build models that can account for time trends, periodic variation (season, weekly, daily) etc. Detecting a change-point is largely a matter of detecting when data deviates sufficiently from a forecast (the expected value), thus forecasting will be central to this project.

In the project formulation it is said that conditionally on available covariates it is expected that demands at different time points are independent. This is one first hypothesis that can and should be investigated using data.

Data

The data can only be obtained by individual aggreement with Sense Analytics.

Literature

FPP: Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos

cpt: changepoint: An R Package for Changepoint Analysis by Rebecca Killick and Idris A. Eckley, and the corresponding R package changepoint.

FPP is a very good introduction to forecasting and time series analysis in general.

Causal effect estimation

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Statistics 1 and 2. It’s not so much a technical extension but rather a conceptual extension that clarify how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The final report should include both a theoretical part and a practical data analysis using the Vitamin D data below. For the practical part, the focus is on the causal effect of BMI on vitamin D level. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods.

Data

Vitamin D data

The data comes from a study on vitamin D status in four European countries, conducted by Rikke Andersen, Fødevaredirektoratet, Denmark. The data contains the following variables:

Variable	Description
age	age of the individual

bmi	body mass index

country	country of residence
	1: Denmark
	2: Finland
	4: Ireland
	6: Poland

category	1: girls
	2: women

vitd	the level of vitamin D, 25-hydroxy-vitamin D (25OHD) in serum

sunexp	sun exposure
	1: avoid sun
	2: sometimes
	3: prefer sun

vitdintake	Vitamin D intake

Literature

CI: Causal Inference by Hernán MA and Robins JM

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is an interesting further reading on mediation and interaction.

Bachelor’s thesis in statistics

Niels Richard Hansen

August 16, 2019

Formalities

Regarding the contract

More information

Prerequisites

Overall objective

General advice

Projects

Predicting company default

Data

Literature

Water Leakage detection

Data

Literature

Causal effect estimation

Data

Literature