This document outlines four proposed thesis projects for the bachelor’s degree in mathematics or mathematics-economy at the University of Copenhagen.
The thesis is written during block 3 and block 4, 2019. The start date is February 4 and the thesis is handed in on June 7. There is a subsequent oral defense.
Use the project descriptions below and take a look at the suggested literature to come up with a proposed title and description. I will then read and comment on your proposal and approve it afterwards. I suggest that you email me a draft of the contract, and when I have commented on it, you can drop by my office for a signature. Here follows some information that needs to go into the contract.
The meeting frequency will be once every second week for two hours during block 3 and once every second week for one hour during block 4. The block 3 meetings will be in groups. The block 4 meetings will be individual meetings by default. There will be four group meetings and three individual meetings in total. The first group meeting will be in week 1 or 2 of block 3, and the last individual meeting will be in week 5 of block 4.
As a student you are expected to be prepared for the meetings by having worked on the material that was agreed upon. If you have questions you are expected to have prepared the questions and be prepared to explain what you have done yourself to solve the problems. In particular, if you have questions regarding R code, you are expected to have prepared a minimal, reproducible example (specifics for R examples).
As a supervisor I will be prepared to help you with technical questions as well as more general questions that you may have prepared. For the group meetings we can discuss general background knowledge, and we can also discuss ad hoc exercises if that is relevant. For the individual meetings you are welcome to send questions or samples of text for me to read and provide feedback on before the meeting. Note that I will generally not be able to find bugs in your R code.
Responsibilities and project contract
Studieordning, see Bilag 1.
Studieordning, matematik, see Bilag 3 for the formal thesis objectives.
The formal prerequisites are Measure Theory and the Statistics 1 and 2 courses, but you are also expected to be interested in the following:
The overall objective of all four projects is to train you in working on your own with a larger data modeling problem. This includes narrowing down the statistical theory that you want to focus on and the corresponding analysis of data.
You are encouraged to use R Markdown (and perhaps also Tidyverse as described in R4DS) to organize data analysis, simulations and other practical computations. But you should not hand in the raw result of such a document. That document should serve as a log of your activities and help you carry out reproducible analysis. The final report should be written as an independent document. Guidance on how to write the report will be provided later.
R4DS: R for Data Science by Garrett Grolemund and Hadley Wickham
RMD: R Markdown: The Definitive Guide by Yihui Xie, J. J. Allaire, Garrett Grolemund
Advice on writing a report. Please note that the linked document was prepared specifically for the 7.5 ECTS project course “Project in Statistics”, and it has a focus on writing an applied project. The advice on using the IMRaD structure does not apply directly to the bachelor’s thesis that you are writing, which, in particular, should contain a theory section/chapter. But most of the general advices apply.
Data for this project comes from the BAC 2017 case competition and was collected by the team of students from MATH, who won the competition.
The main objective of this project is to build a predictive model of whether a company will default in the future given data on the company. A simple model can be a linear model as known from Statistics 1 and 2. One alternative is a logistic regression model, but it’s also possible to explore some models that are not in the standard curriculum of other statistics courses such as
It’s possible to dive deep into the theory of one of these methods and corresponding algorithms, but it’s also possible to focus on building a good prediction model using any method available. In addition, it’s possible to go deeper into the question of how the performance of such a predictive model is evaluated.
A data set for this project is available as a csv file:
Once you have downloaded the file (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).
readr::read_csv("DefaultDKTotal.csv")
It’s a fairly large data set with 636,855 rows and 76 columns. Earch row represents a company in one fiscal year. The columns represent general information about the company, values extracted from the annual report, some data related to the geographical location of the company, and some key figures computed from the annual report. Many columns have self-explaining names. The column default
is binary and encodes if the company defaults during the period that the data collection covers (one means default). The variable Normal_slut
gives the actual default date.
To begin with you may ignore that the same company is represented in several rows corresponding to different fiscal years, or better yet, you may restrict attention to only a subset of companies from a single year. Later in the project it’s quite important that you take into account the fact that data spans a period of several years.
ISL: An Introduction to Statistical Learning with R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
caret: The caret Package by Max Kuhn
ESL: The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman.
Start with ISL. It contains a lot of material on how to actually work with the methods in R and several good models and methods. For a systematic way to use and evaluate predictive models in R you can consider using the caret R package.
More in-depth theoretical treatments can be found in ESL, which is a modern classic.
The purpose of a recommender system is to recommend an individual one or more items that he/she might be interested in based on information about that individual. It’s possible to build such recommender systems using features of the individuals as well as the items. That is, if John is looking for a vacation and he informs us that he likes sailing, that he likes Danish islands, and that he doesn’t like flying we can recommend that he takes the ferry to Bornholm for vacation. For a feature based recommendation to work we need feature information on items as well as individuals so that we can match up the individual with items of interest.
An alternative to feature based recommendations is know as collaborative filtering (here “filtering” means prediction of unobserved variables). Instead of explicit feature knowledge we need to know a (small) set of item preferences for all individuals and based on this we attempt to predict all item preferences for all individuals and then recommend items with a high predicted preference to any given individual. In collaborative filtering there are no explicit features; they are implicitly learned (collaboratively) from the preferences of many individuals.
Let the (numerical) preference for the \(i\)-th individual of the \(j\)-th item be denoted \(x_{ij}\). With \(m\) individuals and \(n\) items we can regard the entire collection of preferences as an \(m \times n\) matrix \[X = (x_{ij})_{i,j}.\] We will only have observations of a very small subset of the \(x_{ij}\)-entries in this matrix, and the objective is to fill in all the unobserved entries. For this reason we often call the problem a matrix completion problem.
The matrix completion problem has become intensively studied not least due to the Netflix prize, which was a competition launched by Netflix in 2006 on improving their movie recommender system based on data on the users’ movie preferences.
This project will deal with collaborative filtering and matrix completion methods using preference data from MovieLens. It’s possible to focus on different things in the project. For instance,
SMRS: Statistical Methods for Recommender Systems
RLIB: R libraries for recommender systems
MCSVD: Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares by Trevor Hastie, Rahul Mazumder, Jason D. Lee, Reza Zadeh
LIBMF: LIBMF: A Matrix-factorization Library for Recommender Systems by the Machine Learning Group at National Taiwan University (note that you can use this library from R via the recosystem package, see RLIB)
SMRS can be accessed when you are on campus. It’s a general introduction to recommender systems. Consult RLIB for relevant R packages, in particular recommenderlab may be relevant for general methods, while recosystem (R wrapper of LIBMF) is for exploring matrix factorization based methods. The LIBMF page and the MCSVD paper may be consulted for more technical details.
Data consists of document metadata available via open data access from the Danish Parliament. The purpose of this project is to build a topic model of the document titles.
This project is on the modeling of natural language data at the word occurrence level. That is, we regard words in a document title as sampled independently from a probabilistic model. This will not produce meaningful sentences, but it’s surprising how much information there is about the document in the choice of the words used. In particular, the word usage can be used to cluster documents in terms of topics.
The data has to the best of my knowledge not been analyzed before, thus there are plenty of novel observations that can be made from rather basic descriptive explorations of the data. In particular, because the documents come with various additional metadata (time stamps, type and category id etc) that can be associated to the characteristics of the documents titles.
The main purpose of the project is, however, to build a topic model using the Latent Dirichlet Allocation model. There are different possibilities for the project such as
A data set for this project is available as a csv file:
folketing_open_data_documents.csv
Once you have downloaded the file (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).
readr::read_csv("folketing_open_data_documents.csv")
The data was extracted from Folketingets Åbne Data using the httr package in R. The data table contains 132,317 rows representing the same number of documents. The table contains four columns of different id-numbers, a title column and a date. The title column is the most important one for the project as this is the text data you will model.
There are more documents in the database, and it’s also possible to extract additional data on e.g. what the numerical id-columns represent or even linking the document id to individuals and their roles.
TMwR: Text Mining with R: A Tidy Approach by Julia Silge and David Robinson
LDA: Latent Dirichlet Allocation by Blei, Ng and Jordan
PRML: Pattern Recognition and Machine Learning by Christopher M. Bishop
TMwR is a relatively accessible introduction to text analysis in R including a chapter on topic modeling. LDA is the original paper on the Latent Dirichlet Allocation model and should be quite accessible, though you may need to to do some background reading. I recommend PRML in general, specifically Chapters 10 and 11 are relevant.
This project is on estimation of causal effects from observational data. That is, data collected without carrying out a randomized allocation of “treatments” to individuals. This is a major challenge of contemporary statistics, and it’s conceptually completely different from what you have learned in previous courses.
One data set is provided where effects are actually simulated (though covariates are real data). This will make it possible to investigate properties of different methods as we know the ground truth. Another data set on survival times is also provided. Prior experience with survival analysis is recommended for considering this data set.
The main purpose of this project is to work with the rigorous framework of causal models and causal inference. This can be seen as an extension of linear models as taught in Statistics 1 and 2. It’s not so much a technical extension but rather a conceptual extension that clarify how we actually want to use linear models (and other methods) in practice to estimate causal effects.
The data set using simulated responses is available as a csv file:
causal.csv
counterfactuals.csv
It has 50.000 rows and 180 columns. Disregarding the first column, which is a sample id, there are 177 covariates, one column named z
representing the binary treatment and one column named y
representing the continuous response. As mentioned, the covariates are actually real data from the Linked Births and Infant Deaths Database. If you want to find out what the column names mean, you can read the user guide. The meaning of the variables is not linked to the simulated responses, and understanding the meaning might not be particularly important for the project. But it could help you understand the correlation patterns between the covariates.
The other data set for survival analysis is also available as a csv file:
This data set contains the columns age
(age at baseline), filaggrin
(binary indicator of whether the subject has mutations in the filaggrin gene), vitd
(vitamin D level at baseline, measured as serum 25-OH-D (nmol/L)), time
(follow-up time) and death
(indicator of whether the subject died during follow-up.)
Once you have downloaded the file you need (right click on the filename and choose download) you can read it into R using e.g. the readr package (part of Tidyverse).
readr::read_csv("causal.csv") ## Or
readr::read_csv("VitaminD.csv")
CI: Causal Inference by Hernán MA and Robins JM
ECI: Explanation in Causal Inference: Methods for Mediation and Interaction
BCIA: Benchmarking Framework for Performance-Evaluation of Causal Inference Analysis
IVCOX: Instrumental variables estimation under a structural Cox model
CI will be the main textbook for this project. In particular Part II of the book. Note that R code is available for the examples. BCIA describes a setup for assessing performance of statistical methods for causal inference, and we will return to this during the project, though it may not be central to the project in the beginning. FB is an interesting recent paper that attempts to benchmark methods for causal inference from observational data against randomized controlled trials in a marketing setting. ECI is an interesting further reading on mediation and interaction. IVCOX analyzes the survival data and is only relevant for the project using this data set.