This document outlines three thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen.

There will be an info meeting on Monday, January 13, 13.15-15.00, in Lille UP1

Previous years project proposals are available for the spring 2019 and the fall 2019.

Formalities

The thesis is written during block 3 and block 4, 2020. The start date is February 3 and the thesis is handed in on June 5. There is a subsequent oral defense.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. I suggest that you email me a draft of the contract, and when I have commented on it, you can drop by my office for a signature. Here follows some information that needs to go into the contract.

The meeting frequency will be once every second week for two hours during block 1 and once every second week for 45 min. during block 2. The block 1 meetings will be in groups. The block 2 meetings will be individual meetings by default. There will be four group meetings and three individual meetings in total.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are the courses Measure Theory and Mathematical Statistics (or Statistics 1 and 2), but you are also expected to be interested in the following:

Overall objective

The overall objective of all three projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report will be provided later.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advice apply.

Projects

Predicting sales of local products

This project runs in collaboration with the company Råhandel, who has an online marketplace for small food producers and business purchasers such as restaurants, supermarkets and shops. The company will provide data and help to shape some important questions for the projects.

One main objective is to make a predictive model of whether and where a product will sell well based on the type of product and its specifications such as being organic. Standard predictive models are linear models known from Mathematical Statistics (Statistics 1 and 2) or logistic regression models as treated in Regression, but it’s possible to explore predictive models that are not in the standard curriculum of statistics courses such as trees and random forests or generalized additive models.

A first idea is to build predictive models for each individual product treating the sold units as the target variable, but there are hundreds of products and it might not be so easy to aggregate such models, nor will it be easy to capture the differences in the distribution of sold units across all those individual products by treating product as a predictor. A different idea is therefore to first construct a simpler target, for instance an indicator we can call “large order”. To construct such a target some initial exploratory data analysis is required, which amounts to investigate the distribution of orders, their size and their development over time for the different products. It may be that a relevant target is simply the indicator of an order (thus “large order” becomes equal to an order of one or more units).

The predictive modeling approach outlined above is also sometimes called supervised learning. A different objective can be to understand the multivariate structure in the distribution of sold units of the different products. This will be an unsupervised learning problem. The project will then have a more exploratory nature, but it will be possible to later link some of the exploratory constructs to descriptor variables such as product type via supervised learning techniques.

Data

Students will get access to data via Google Sheets documents with API automatically updating the document with more information. All students choosing this project must sign a confidentiality agreement.

The data describes sales and product data. (Other data regarding producers and customers can be made available upon request.) Under sales data: Each row represents a purchased product. If more products have purchased as part of an order, they will have the same order number. Product data: Self-explanatory from column names apart from sales area, defining in which areas the product can be bought. The column should all have self-explanatory names. Solveig from Råhandel will come for a few group meetings during Spring to answer questions and discuss the relevance of your suggested solutions.

Literature

ISL: An Introduction to Statistical Learning with R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

caret: The caret Package by Max Kuhn

ESL: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

Start with ISL. It contains a lot of material on how to actually work with the methods in R and several good models and methods. For a systematic way to use and evaluate predictive models in R you can consider using the caret R package.

More in-depth theoretical treatments can be found in ESL, which is a modern classic.

If you want to do an project on unsupervised learning, you need to focus on Chapter 10 in ISL and Chapter 14 in ESL.

Causal effect estimation

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Mathematical Statistics. It’s not so much a technical extension but rather a conceptual extension that clarify how we actually want to use linear models (and other methods) in practice to estimate causal effects.

The final report should include both a theoretical part and a practical data analysis using the ?? data below. Several causal models could be considered and several different methods for estimation of a causal effect could be used. You need to make a selection and present the relevant theory. Simulations using model examples derived from the data should be considered and used to investigate different methods.

Data

Data for this projected is expected to be on the causal effect of education and salery. It is a classical research question in economy and social sciences to understand the effect of education on a variety of outcome measures such as salery.

Literature

CI: Causal Inference by Hernán MA and Robins JM

CIS: Causal inference in statistics: An overview by Judea Pearl.

CIG: Causal Inference from Graphical Models by Steffen Lauritzen.

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. CIS is a good supplement outlining Judea Pearl’s way of presenting the theory, and CIG is likewise a good supplement from Steffen Lauritzen’s perspective. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is interesting further reading on mediation and interaction.

Mortality forecasting

Mortality forecasting is a central problem in demography, and it is a very important tool for making prognoses about how population size and age distribution in a country, say, will evolve. Such forecasts are used to make policy decisions about pension and health care systems among other things.

We have historic data on mortality in Denmark (and many other countries), and mortality forecasting can be seen as an example of time series forecasting, which has numerous other applications. What is special about mortality forecasting compared to most standard time series forecasting problems is that it is an inherently multivariate problem; mortality depends on age, and the typical approach is to forecast mortality within age groups.

The project will focus on the by now widely used Lee-Carter model, time series methods for forecasting the motality factor(s) of the Lee-Carter model and metrics for evaluating the forecasts.

The first part of this project will be like a journal club consisting of reading some papers on the Lee-Carter model.

Data and R package

Data for this project is to be obtained from The Human Mortality Database. You need an account (free) to download data.

The Methods Protocol from the database can be a very useful document to read.

The R package demography maintained by Rob J Hyndman will also be useful.

Literature

LC: Modeling and Forecasting U.S. Mortality by Ronald D. Lee and Lawrence R. Carter

LCM: Lee-Carter mortality forecasting: a multi-country comparison of variants and extensions by Heather Booth, Rob J. Hyndman, Leonie Tickle and Piet de Jong

COMP: Point and interval forecasts of mortality rates and life expectancy: A comparison of ten principal component methods by Han Lin Shang, Heather Booth and Rob J. Hyndman

SAINT: Modelling Adult Mortality in Small Populations: The Saint Model by Søren Fiig Jarner and Esben Masotti Kryger

FPP: Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos

PLB A Poisson log-bilinear regression approach to the construction of projected lifetables by Natacha Brouhns, Michel Denuit, Jeroen K. Vermunt.

IFM The impact of the choice of life table statistics when forecasting mortality by Marie-Pier Bergeron-Boucher, Søren Kjærgaard, Jim Oeppen and James W. Vaupel

LC is the orginal paper by Lee and Carter. It should be quite readable. LCM and COMP are two examples of papers treating the Lee-Carter model and various extensions together with methods for time series forecasting and evaluation of the mortality forecasts. SAINT describes the model currently used by ATP. FPP is a general reference on time series forecasting that may be useful.

The paper PLB was added later as an alternative way of fitting the Lee-Carter model, and IFM is a recent study by a group in Odense on the effect of choosing different targets for forecasting on the ultimate quality of the forecast.