class: center, middle, inverse, title-slide # Computational Statistics,
Functions and Environments ### Niels Richard Hansen ### September 2, 2019 --- ## Computational Statistics ** Problems ** * Compute nonparametric estimates. * Compute integrals (and probabilities) `\(\ \ \ \ \ \ E f(X) = \int f(X) \ dP \ \ \ \left(P(X \in A) = \int_{(X \in A)} \ dP\right)\)`. * Optimize the likelihood. ** Methods ** * Numerical linear algebra. * Monte Carlo integration. * EM-algorithm, stochastic gradient. --- ## Example, amino acid angles <img src="PhiPsi_creative.jpg" width="400" height="400" style="display: block; margin: auto;" /> --- ## Ramachandran plot .two-column-left[ ```r qplot(phi, psi, data = phipsi) ``` <!-- --> ] .two-column-right[ ```r qplot(phi, psi, data = phipsi2) ``` <!-- --> ] --- ## Example, amino acid angles .two-column-left[ ```r hist(phipsi$phi, prob = TRUE) rug(phipsi$phi) ``` <!-- --> ] .two-column-right[ ```r hist(phipsi$psi, prob = TRUE) rug(phipsi$psi) ``` <!-- --> ] --- ## Example, amino acid angles .two-column-left[ ```r lines(density(phipsi$phi), col = "red", lwd = 2) ``` <!-- --> ] .two-column-right[ ```r lines(density(phipsi$psi), col = "red", lwd = 2) ``` <!-- --> ] --- ## Statistical topics of the course * **Smoothing:** what does `density` do? + How do we compute nonparametric estimators? + How do we choose tuning parameters? * **Simulation:** how do we efficiently simulate from a target distribution? + How do we assess results from Monte Carlo methods? + What if we cannot compute the density? * **Optimization:** how do we compute the MLE? + What if we cannot compute the likelihood? + How to deal with very large data sets? --- ## Computational topics of the course * **Implementation**: writing statistical software. + R data structures and functions + S3 object oriented programming * **Correctness**: does the implementation do the right thing? + testing + debugging + accuracy of numerical computations * **Efficiency**: minimize memory and time usage. + benchmark code for comparison + profile code for identifying bottlenecks + optimize code (Rcpp) --- ## Prerequisites in R Good working knowledge of: * Data structures (vectors, lists, data frames). * Control structures (loops, if-then-else). * Function calling. * Interactive and script usages (`source`) of R. * You don't need to be an experienced programmer. --- ## Assignments The 8 assignments covering 4 topics will form the backbone for the course. Many lectures and practicals will be build around these assignments. You all need to register (in Absalon) for the presentation of one assignment solution. * Presentations are done in groups of two-three persons. * On four Wednesdays there will be presentations with discussion and feedback. * For the exam you need to prepare four *individual* presentations, one for each topic assignment. --- ## Exam For each of the four topics you choose one out of two assignments to prepare for the exam. * The exam assessment is based on your presentation *on the basis of the entire content of the course*. * Get started immediately and work continuously on the assignments as the course progresses. --- ## R programming R functions are fundamental. They don't do anything before they are called and the call is evaluated. An R function takes a number of *arguments*, and when a function call is evaluated it computes a *return value*. An R program consists of a hierarchy of function calls. When the program is executed, function calls are evaluated and replaced by their return values. Implementations of R functions are collected into source files, which can be organized into R packages. An R script (or R Markdown document) is a collection of R function calls, which, when evaluated, compute a desired result. R programming includes activities at many different levels of sophistication and abstraction. --- ## R programming R programming can be * writing R scripts (for data analysis), interactively running scripts and building reports or other output. * writing R functions to ease repetitive tasks (avoid copy-paste), to abstract computations, to improve overview by modularization etc. * developing R packages to ease distribution, to improve usability and documentation, to clarify dependencies etc. --- ## Writing an R function A function that counts the number of zeros in a vector. ```r countZero <- function(x) { nr <- 0 for(i in seq_along(x)) ## 1:length(x) wrong if length(x) is 0 nr <- nr + (x[i] == 0) ## logical coerced to numeric nr } ``` The implementation works in most programming languages: * Initialize a counter to be 0. * Loop through the elements of the vector. * Increment the counter whenever an element equals zero. * Return the value of the counter. --- ## Testing an R function ```r tests <- list( c(0, 0, 0, 0, 25), c(0, 1, 10, 0, 5, -4, 2.5), c(2, 1, 10), c()) ## Extreme case sapply(tests, countZero) ``` ``` ## [1] 4 2 0 0 ``` We can manually inspect that the results are correct, or we can do it programmatically. ```r all(sapply(tests, countZero) == c(4, 2, 0, 0)) ``` ``` ## [1] TRUE ``` --- ## Writing an R function, version 2 ```r countZero2 <- function(x) sum(x == 0) ``` Testing: ```r all(sapply(tests, countZero) == sapply(tests, countZero2)) ``` ``` ## [1] TRUE ``` ```r identical( sapply(tests, countZero), sapply(tests, countZero2) ) ``` ``` ## [1] FALSE ``` --- ## Results are equal but not identical!? ```r typeof(sapply(tests, countZero)) ``` ``` ## [1] "double" ``` ```r countZero <- function(x) { ## counter now of type integer! * nr <- 0L for(i in seq_along(x)) nr <- nr + (x[i] == 0) nr } ``` ```r typeof(sapply(tests, countZero)) ``` ``` ## [1] "integer" ``` --- ## Results are now identical ```r identical( sapply(tests, countZero), sapply(tests, countZero2) ) ``` ``` ## [1] TRUE ``` --- ## Vectorized R code ```r countZero2 ``` ``` ## function(x) ## sum(x == 0) ## <bytecode: 0x7fcc29b8d158> ``` * The expression `x == 0` tests each entry of the vector `x` for being equal to 0 and returns a vector of logicals. * The `sum` function computes and returns the sum of all elements in a vector. Logicals are coerced to integers. * The vectorized implementation is short but expressive and easy to understand. * The vectorized computations are performed by compiled code. They are faster to evaluate than R code. * Writing vectorized code requires a larger knowledge of R functions. --- # Exercises 1.1, 1.2, 1.3 <img src="PracIcon.png" width="400" style="display: block; margin: auto;" /> --- ## Functions R functions consist of three components. ```r foo <- function(x) { x^2 + x + 1 } ``` * The *body* of the function, which is the code in the function. * The *argument list* that the function takes. * The *enclosing environment* where the function was created. --- ## Functions ```r body(foo) ``` ``` ## { ## x^2 + x + 1 ## } ``` ```r formals(foo) ## formal argument list ``` ``` ## $x ``` ```r environment(foo) ## enclosing environment ``` ``` ## <environment: R_GlobalEnv> ``` --- ## Function evaluation When functions are called the body is evaluated with formal arguments replaced by actual arguments (values). ```r foo(10) ## body evaluated with x = 10 ``` ``` ## [1] 111 ``` The result is the same as: ```r {x <- 10; x^2 + x + 1} ``` ``` ## [1] 111 ``` --- ## Function body The function components can be manipulated like other R components. ```r body(foo) <- quote({y*x^2 + x + 1}) foo ## new body ``` ``` ## function (x) ## { ## y * x^2 + x + 1 ## } ``` The function `quote` prevents R from evaluating the expression `{y*x^2 + x + 1}` and simply returns the unevaluated R expression, which then replaces the body of the function `foo`. --- ## Function evaluation What happens when we try to evaluate the new function? ```r foo(10) ``` ``` ## Error in foo(10): object 'y' not found ``` ```r y <- 2 ## y is in the global environment foo(10) ``` ``` ## [1] 211 ``` Variables that are not assigned in the body or are in the list of formal arguments are searched for in the functions *enclosing environment*. --- ## Changing the enclosing environment The enclosing environment can be changed, and values can be assigned to variables in environments. ```r environment(foo) <- new.env() foo(10) ``` ``` ## [1] 211 ``` ```r assign("y", 1, environment(foo)) ``` ```r foo(10) ## What is the result and why? ``` ``` ## [1] 111 ``` --- ## Changing the argument list The argument list can be changed as well. ```r formals(foo) <- alist(x = , y = ) ``` ```r foo(10) ## What will happen? ``` ``` ## Error in foo(10): argument "y" is missing, with no default ``` ```r foo(10, 2) ## Better? ``` ``` ## [1] 211 ``` --- ## How functions work in principle * When a function is called, the actual arguments are matched against the *formal arguments*. * Expressions in the actual argument list are evaluated in the *calling environment*. * Then the *body* of the function is evaluated. * Variables that are not arguments or defined inside the body are searched for in the *enclosing environment*. * The value of the last expression in the body is returned by the function. ```r foo(10, y^2) ``` ``` ## [1] 411 ``` --- ## Scoping rules For a variable we distinguish between the *symbol* (the `y`) and its value (`2`, say). The rules for finding the symbol are called *scoping* rules, and R implements *lexical scoping*. Symbols reside in environments, and when a function call is evaluated R searches for symbols in the environment where the function is *defined* and *not* where it is called. In practice, symbols are recursively searched for by starting in the functions *evaluation environment*, then its *enclosing enviroment* and so on until the empty environment. For interactive usage lexical scoping can appear strange. Setting "global variables" might not always have the "expected effect". The "expectation" is typically a *dynamic scoping* rule, which is difficult to predict and reason about as a programmer. --- ## Functions as first-class objects A function is an R language object that can be treated like other objects. It is a so-called *first-class object*. The language can create and manipulate functions during execution. This means that functions can be arguments. ```r head(optim, 6) ``` ``` ## ## 1 function (par, fn, gr = NULL, ..., method = c("Nelder-Mead", ## 2 "BFGS", "CG", "L-BFGS-B", "SANN", "Brent"), lower = -Inf, ## 3 upper = Inf, control = list(), hessian = FALSE) ## 4 { ## 5 fn1 <- function(par) fn(par, ...) ## 6 gr1 <- if (!is.null(gr)) ``` --- ## Functions as first-class objects Functions can be return values, like in the *function operator* `Vectorize`. ``` ## ## 1 function (FUN, vectorize.args = arg.names, SIMPLIFY = TRUE, ## 2 ... ## 17 FUNV <- function() { ## 18 args <- lapply(as.list(match.call())[-1L], ## 19 eval, parent.frame()) ## 20 ... ## 24 do.call("mapply", c(FUN = FUN, args[dovec], ## 25 MoreArgs = list(args[!dovec]), SIMPLIFY = SIMPLIFY, ## 26 USE.NAMES = USE.NAMES)) ## 27 } ## 28 formals(FUNV) <- formals(FUN) ## 29 FUNV ``` Chapters 10-12 in AvdR cover functional programming in R, but will not be dealt with systematically in this course. --- ## Computing on the language R can extract and compute with its own language constructs. This can be used for *metaprogramming*; writing programs that write programs. Some concrete examples of *computing on the language* are in plot functions for label construction, and in `subset` and `lm` for implementing *non-standard evaluation*, which circumvents the usual scoping rules. The formula interface is an example of a *domain specific language* implemented in R for model specification that uses various techniques for computing on the language. Chapters 13-15 in AvdR cover these techniques in greater detail, but will not be dealt with systematically in this course. --- ## Return values A function can return any kind of R object either explicitly by a `return` statement or as the value of the last evaluated expression in the body. Many functions return basic data structures such as a number, a vector, a matrix or a data frame. Some functions return other functions. Lists can be given a *class label* (`"integrate"` below). This is part of the S3 object system in R, which will be covered next week. --- ## Return values Multiple return values that don't fit into the basic data structures can be returned as a list. ```r integrate(sin, 0, 1) ``` ``` ## 0.4596977 with absolute error < 5.1e-15 ``` ```r str(integrate(sin, 0, 1)) ``` ``` ## List of 5 ## $ value : num 0.46 ## $ abs.error : num 5.1e-15 ## $ subdivisions: int 1 ## $ message : chr "OK" ## $ call : language integrate(f = sin, lower = 0, upper = 1) ## - attr(*, "class")= chr "integrate" ``` --- # Exercises 1.4, 1.5, 1.6 <img src="PracIcon.png" width="400" style="display: block; margin: auto;" /> --- ## Something about environments Recall that several environments are associated with a function. <img src="binding-2.png" width="600" style="display: block; margin: auto;" /> The *binding environment* of `g` is an environment, whose parent is the global environment. --- ## Calling and enclosing environments <img src="binding-2.png" width="400" style="display: block; margin: auto;" /> Suppose that `g <- function(z) z + y`. * What is `g(1)`? * Does it depend on the *calling environment* of `g`? * What is the *enclosing environment* of `g`? --- ## Evaluation environment .two-column-left[ <img src="execution.png" width="350" style="display: block; margin: auto;" /> ] .two-column-right[ ```r h <- function(x) { a <- 2 x + a } y <- h(1) ``` ```r y ``` ``` ## [1] 3 ``` ```r a ``` ``` ## Error in eval(expr, envir, enclos): object 'a' not found ``` ] --- ## Namespaces <img src="namespace.png" width="700" style="display: block; margin: auto;" /> --- ## Function factories ```r fooFactory <- function(y) { function(x) y * x^2 + x + 1 } ``` ```r foo <- fooFactory(2) body(foo) ``` ``` ## y * x^2 + x + 1 ``` ```r foo(10) ``` ``` ## [1] 211 ``` ```r foo2 <- fooFactory(3) ``` --- ## Function factories ```r foo2(10) ``` ``` ## [1] 311 ``` ```r y <- 1 foo(10) ``` ``` ## [1] 211 ``` ```r foo2(10) ``` ``` ## [1] 311 ``` Setting a value of `y` in the calling environment of `foo` or `foo2` doesn't change the behavior of the functions. --- ## Function factories ```r environment(foo) ``` ``` ## <environment: 0x7f95463f8fc0> ``` ```r ls(environment(foo)) ``` ``` ## [1] "y" ``` ```r get("y", environment(foo)) ``` ``` ## [1] 2 ``` The function `fooFactory` is called a function factory because it returns (produces) functions. It's a way to provide a function with a local storage (its environment) for "configuration values". --- ## Function factories ```r environment(foo2) ``` ``` ## <environment: 0x7f9546c5d7f8> ``` Two different functions produced by the function factory have different enclosing environments. ```r parent.env(environment(foo2)) ``` ``` ## <environment: R_GlobalEnv> ``` ```r parent.env(environment(foo)) ``` ``` ## <environment: R_GlobalEnv> ``` The parent environment is the enclosing environment of the factory, in this case the global environment. --- # Exercises 1.7, 1.8, 1.9, 1.10 <img src="PracIcon.png" width="400" style="display: block; margin: auto;" />