This document outlines four proposed thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen.

Formalities

The thesis is written during block 3 and block 4, 2019. The start date is February 4 and the thesis is handed in on June 7. There is a subsequent oral defense.

Regarding the contract

Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. I suggest that you email me a draft of the contract, and when I have commented on it, you can drop by my office for a signature. Here follows some information that needs to go into the contract.

The meeting frequency will be once every second week for two hours during block 3 and once every second week for one hour during block 4. The block 3 meetings will be in groups. The block 4 meetings will be individual meetings by default. There will be four group meetings and three individual meetings in total. The first group meeting will be in week 1 or 2 of block 3, and the last individual meeting will be in week 5 of block 4.

As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).

As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.

More information

Responsibilities and project contract

Studieordning, see Bilag 1.

Studieordning, matematik, see Bilag 3 for the formal thesis objectives.

Prerequisites

The formal prerequisites are Measure Theory and the Statistics 1 and 2 courses, but you are also expected to be interested in the following:

Overall objective

The overall objective of all four projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data.

General advice

You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report will be provided later.

R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham

RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund

Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advices apply.

Projects

Predicting company default

Data for this project comes from the BAC 2017 case competition and was collected by the team of students from MATH, who won the competition.

Winning report

The main objective of this project is to build a predictive model of whether a company will default in the future given data on the company. A simple model can be a linear model as known from Statistics 1 and 2. One alternative is a logistic regression model, but it’s also possible to explore some models that are not in the standard curriculum of other statistics courses such as

  • linear discriminant analysis
  • trees and random forests
  • support vector machines

It’s possible to dive deep into the theory of one of these methods and corresponding algorithms, but it’s also possible to focus on building a good prediction model using any method available. In addition, it’s possible to go deeper into the question of how the performance of such a predictive model is evaluated.

Data

A data set for this project is available as a csv file:

DefaultDKTotal.csv

Once you have downloaded the file (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).

readr::read_csv("DefaultDKTotal.csv")

It’s a fairly large data set with 636,855 rows and 76 columns. Earch row represents a company in one fiscal year. The columns represent general information about the company, values extracted from the annual report, some data related to the geographical location of the company, and some key figures computed from the annual report. Many columns have self-explaining names. The column default is binary and encodes if the company defaults during the period that the data collection covers (one means default). The variable Normal_slut gives the actual default date.

To begin with you may ignore that the same company is represented in several rows corresponding to different fiscal years, or better yet, you may restrict attention to only a subset of companies from a single year. Later in the project it’s quite important that you take into account the fact that data spans a period of several years.

Literature

ISL: An Introduction to Statistical Learning with R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

caret: The caret Package by Max Kuhn

ESL: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.

Start with ISL. It contains a lot of material on how to actually work with the methods in R and several good models and methods. For a systematic way to use and evaluate predictive models in R you can consider using the caret R package.

More in-depth theoretical treatments can be found in ESL, which is a modern classic.

Matrix completion and recommender systems

The purpose of a recommender system is to recommend an individual one or more items that he/she might be interested in based on information about that individual. It’s possible to build such recommender systems using features of the individuals as well as the items. That is, if John is looking for a vacation and he informs us that he likes sailing, that he likes Danish islands, and that he doesn’t like flying we can recommend that he takes the ferry to Bornholm for vacation. For a feature based recommendation to work we need feature information on items as well as individuals so that we can match up the individual with items of interest.

An alternative to feature based recommendations is know as collaborative filtering (here “filtering” means prediction of unobserved variables). Instead of explicit feature knowledge we need to know a (small) set of item preferences for all individuals and based on this we attempt to predict all item preferences for all individuals and then recommend items with a high predicted preference to any given individual. In collaborative filtering there are no explicit features; they are implicitly learned (collaboratively) from the preferences of many individuals.

Let the (numerical) preference for the \(i\)-th individual of the \(j\)-th item be denoted \(x_{ij}\). With \(m\) individuals and \(n\) items we can regard the entire collection of preferences as an \(m \times n\) matrix \[X = (x_{ij})_{i,j}.\] We will only have observations of a very small subset of the \(x_{ij}\)-entries in this matrix, and the objective is to fill in all the unobserved entries. For this reason we often call the problem a matrix completion problem.

The matrix completion problem has become intensively studied not least due to the Netflix prize, which was a competition launched by Netflix in 2006 on improving their movie recommender system based on data on the users’ movie preferences.

This project will deal with collaborative filtering and matrix completion methods using preference data from MovieLens. It’s possible to focus on different things in the project. For instance,

  • building a high performance recommender system on the MovieLens data
  • diving into the theory and algorithms behind one or more of the matrix completion methods
  • working with performance assessment and benchmarking of algorithms and implementations.

Literature

SMRS: Statistical Methods for Recommender Systems

RLIB: R libraries for recommender systems

MCSVD: Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares by Trevor Hastie, Rahul Mazumder, Jason D. Lee, Reza Zadeh

LIBMF: LIBMF: A Matrix-factorization Library for Recommender Systems by the Machine Learning Group at National Taiwan University (note that you can use this library from R via the recosystem package, see RLIB)

SMRS can be accessed when you are on campus. It’s a general introduction to recommender systems. Consult RLIB for relevant R packages, in particular recommenderlab may be relevant for general methods, while recosystem (R wrapper of LIBMF) is for exploring matrix factorization based methods. The LIBMF page and the MCSVD paper may be consulted for more technical details.

Topic modeling of parliament documents

Data consists of document metadata available via open data access from the Danish Parliament. The purpose of this project is to build a topic model of the document titles.

This project is on the modeling of natural language data at the word occurrence level. That is, we regard words in a document title as sampled independently from a probabilistic model. This will not produce meaningful sentences, but it’s surprising how much information there is about the document in the choice of the words used. In particular, the word usage can be used to cluster documents in terms of topics.

The data has to the best of my knowledge not been analyzed before, thus there are plenty of novel observations that can be made from rather basic descriptive explorations of the data. In particular, because the documents come with various additional metadata (time stamps, type and category id etc) that can be associated to the characteristics of the documents titles.

The main purpose of the project is, however, to build a topic model using the Latent Dirichlet Allocation model. There are different possibilities for the project such as

  • using standard implementations available via R packages and focus on the modeling and analysis of the data, and relating the results to other metadata available
  • focusing on the theory behind LDA and the algorithms used for practical computations
  • diving into variational inference as an approximation technique used in LDA and comparing it to exact computations and other approximations.

Data

A data set for this project is available as a csv file:

folketing_open_data_documents.csv

Once you have downloaded the file (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).

readr::read_csv("folketing_open_data_documents.csv")

The data was extracted from Folketingets Åbne Data using the httr package in R. The data table contains 132,317 rows representing the same number of documents. The table contains four columns of different id-numbers, a title column and a date. The title column is the most important one for the project as this is the text data you will model.

There are more documents in the database, and it’s also possible to extract additional data on e.g. what the numerical id-columns represent or even linking the document id to individuals and their roles.

Literature

TMwR: Text Mining with R: A Tidy Approach by Julia Silge and David Robinson

LDA: Latent Dirichlet Allocation by Blei, Ng and Jordan

PRML: Pattern Recognition and Machine Learning by Christopher M. Bishop

TMwR is a relatively accessible introduction to text analysis in R including a chapter on topic modeling. LDA is the original paper on the Latent Dirichlet Allocation model and should be quite accessible, though you may need to to do some background reading. I recommend PRML in general, specifically Chapters 10 and 11 are relevant.

Causal effect estimation

This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.

One data set is provided where effects are actually simulated (though covariates are real data). This will make it possible to investigate properties of different methods as we know the ground truth. Another data set on survival times is also provided. Prior experience with survival analysis is recommended for considering this data set.

The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Statistics 1 and 2. It’s not so much a technical extension but rather a conceptual extension that clarify how we actually want to use linear models (and other methods) in practice to estimate causal effects.

Data

The data set using simulated responses is available as a csv file:

causal.csv counterfactuals.csv

It has 50.000 rows and 180 columns. Disregarding the first column, which is a sample id, there are 177 covariates, one column named z representing the binary treatment and one column named y representing the continuous response. As mentioned, the covariates are actually real data from the Linked Births and Infant Deaths Database. If you want to find out what the column names mean, you can read the user guide. The meaning of the variables is not linked to the simulated responses, and understanding the meaning might not be particularly important for the project. But it could help you understand the correlation patterns between the covariates.

The other data set for survival analysis is also available as a csv file:

VitaminD.csv

This data set contains the columns age (age at baseline), filaggrin (binary indicator of whether the subject has mutations in the filaggrin gene), vitd (vitamin D level at baseline, measured as serum 25-OH-D (nmol/L)), time (follow-up time) and death (indicator of whether the subject died during follow-up.)

Once you have downloaded the file you need (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).

readr::read_csv("causal.csv") ## Or
readr::read_csv("VitaminD.csv")  

Literature

CI: Causal Inference by Hernán MA and Robins JM

ECI: Explanation in Causal Inference: Methods for Mediation and Interaction

BCIA: Benchmarking Framework for Performance-Evaluation of Causal Inference Analysis

FB: A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook

IVCOX: Instrumental variables estimation under a structural Cox model

CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. BCIA describes a setup for assessing performance of statistical methods for causal inference, and we will return to this during the project, though it may not be central to the project in the beginning. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is an interesting further reading on mediation and interaction. IVCOX analyzes the survival data and is only relevant for the project using this data set.