--- title: "An introduction to plasso" author: - Michael Knaus - Stefan Glaisner date: "`r format(Sys.time(), '%B %d, %Y')`" bibliography: assets/plasso_refs.bib link-citations: true output: rmarkdown::html_vignette: fig_width: 8 fig_height: 6 toc: true code_folding: show vignette: > %\VignetteIndexEntry{An introduction to plasso} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include=FALSE} rm(list=ls()) gc() hook_output <- knitr::knit_hooks$get("output") knitr::knit_hooks$set(output = function(x, options) { if (!is.null(n <- options$out.lines)) { x <- xfun::split_lines(x) if (length(x) > n) { x <- c(head(x, n), "....\n") } x <- paste(x, collapse = "\n") } hook_output(x, options) }) ``` This notebook provides a detailed overview over the `plasso` package and its two main functions `plasso` and `cv.plasso` which were developed in the course of @knaus. This package is strongly oriented around the `glmnet` package and rests on its standard function `glmnet` in its very basis. Related theory and algorithms are described in @glmnet. # Getting started The very latest version of the package can be installed from its [Github page](https://github.com/stefan-1997/plasso). For the installation you will need the `devtools` package. The latest 'official' version can be installed from CRAN using `install.packages()'. We recommend the latter. General dependencies are: `glmnet`, `Matrix`, `methods`, `parallel`, `doParallel`, `foreach` and `iterators`. ```{r startup, eval=FALSE, error=FALSE, warning=FALSE, message=FALSE} library(devtools) devtools::install_github("stefan-1997/plasso") install.packages("plasso") ``` Load `plasso` using `library()`. ```{r load} library(plasso) ``` The package generally provides two functions `plasso` and `cv.plasso` which are both built on top of the `glmnet` functionality. Specifically, a `glmnet` object lives within both functions and also in their outputs (list item `lasso_full`). The term `plasso` refers to a Post-Lasso model which estimates a least squares algorithm only for the active (i.e. non-zero) coefficients of a previously estimated Lasso models. This follows the idea that we want to do selection but without shrinkage. The package comes with some simulated data representing the following DGP: The covariates matrix $X$ consists of 10 variables whose effect size one the target $Y$ is defined by the vector $\boldsymbol{\pi} = [1, -0.83, 0.67, -0.5, 0.33, -0.17, 0, ..., 0]'$ where the first six effect sizes decrease in absolute terms continuously from 1 to 0 and alternate in their sign. The true causal effect of all other covariates is 0. The variables in $X$ follow a normal distribution with mean zero while the covariance matrix follows a Toeplitz matrix, which is characterized by having constant diagonals: $$ \boldsymbol{\Sigma} = \begin{bmatrix} 1 & 0.7 & 0.7^2 & ... & 0.7^{9} \\ 0.7 & 1 & 0.7 & ... & 0.7^{8} \\ 0.7^2 & 0.7 & 1 & ... & 0.7^{7} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0.7^{9} & 0.7^{8} & 0.7^{7} & ... & 1 \end{bmatrix} $$ The target $\boldsymbol{y}$ is then a linear transformation of $\boldsymbol{X}$ plus a vector of standard normal random variables. Each element of $\boldsymbol{y}$ is given by: $$ y_i = \boldsymbol{X}_i \boldsymbol{\pi} + \varepsilon_i $$ where $\varepsilon_i \sim \mathcal{N}(0,4)$. ```{r data} data(toeplitz) y = as.matrix(toeplitz[,1]) X = toeplitz[,-1] ``` # plasso `plasso` returns least squares estimates for all lambda values of a standard `glmnet` object for both a simple Lasso and a Post-Lasso model. ```{r fitplasso, out.lines=10} p = plasso::plasso(X,y) ``` You can plot the coefficient paths for both the Post-Lasso model as well as the underlying 'original' Lasso model. This nicely illustrates the difference between the Lasso and Post-Lasso models where the latter is characterized by jumps in its coefficient paths every time a new variable enters the active set. ```{r plotplasso, error=FALSE, warning=FALSE, message=FALSE} plot(p, lasso=FALSE, xvar="lambda") plot(p, lasso=TRUE, xvar="lambda") ``` We can also have a look at which coefficients are active for a chosen lambda value. Here, the difference between Post-Lasso and Lasso becomes clearly visible. For the Lasso model, there is not only feature selection but shrinkage which results in the active coefficients being smaller than for the Post-Lasso model: ```{r coefplasso, error=FALSE, warning=FALSE, message=FALSE} coef_p = coef(p, s=0.01) as.vector(coef_p$plasso) as.vector(coef_p$lasso) ``` # cv.plasso The `cv.plasso` function uses cross-validation to determine the performance of different values for the `lambda` penalty term for both models (Post-Lasso and Lasso). The returned output of class `cv.plasso` includes the mean squared errors. When applying the `summary` method and setting the `default` parameter as FALSE, you can get some informative output considering the optimal choice of lambda. ```{r fitcvplasso} p.cv = plasso::cv.plasso(X,y,kf=5) summary(p.cv, default=FALSE) ``` Using the `plot` method extends the basic `glmnet` visualization by the cross-validated MSEs for the Post-Lasso model. ```{r plotcvplasso, fig.width=7, fig.height=3} plot(p.cv, legend_pos="left", legend_size=0.5) ``` We can use the following code to get the optimal lambda value (for the Post-Lasso model here) and the associated coefficients at that value of $\lambda$. ```{r index_min_plasso} p.cv$lambda_min_pl ``` ```{r coef_min_plasso, out.lines=10} coef_pcv = coef(p.cv, S="optimal") as.vector(coef_pcv$plasso) ```