---
title: "An introduction to plasso"
author:
- Michael Knaus
- Stefan Glaisner
date: "`r format(Sys.time(), '%B %d, %Y')`"
bibliography: assets/plasso_refs.bib
link-citations: true
output:
rmarkdown::html_vignette:
fig_width: 8
fig_height: 6
toc: true
code_folding: show
vignette: >
%\VignetteIndexEntry{An introduction to plasso}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r, include=FALSE}
rm(list=ls())
gc()
hook_output <- knitr::knit_hooks$get("output")
knitr::knit_hooks$set(output = function(x, options) {
if (!is.null(n <- options$out.lines)) {
x <- xfun::split_lines(x)
if (length(x) > n) {
x <- c(head(x, n), "....\n")
}
x <- paste(x, collapse = "\n")
}
hook_output(x, options)
})
```
This notebook provides a detailed overview over the `plasso` package and its two main functions `plasso` and `cv.plasso` which were developed in the course of @knaus. This package is strongly oriented around the `glmnet` package and rests on its standard function `glmnet` in its very basis. Related theory and algorithms are described in @glmnet.
# Getting started
The very latest version of the package can be installed from its [Github page](https://github.com/stefan-1997/plasso). For the installation you will need the `devtools` package. The latest 'official' version can be installed from CRAN using `install.packages()'. We recommend the latter.
General dependencies are: `glmnet`, `Matrix`, `methods`, `parallel`, `doParallel`, `foreach` and `iterators`.
```{r startup, eval=FALSE, error=FALSE, warning=FALSE, message=FALSE}
library(devtools)
devtools::install_github("stefan-1997/plasso")
install.packages("plasso")
```
Load `plasso` using `library()`.
```{r load}
library(plasso)
```
The package generally provides two functions `plasso` and `cv.plasso` which are both built on top of the `glmnet` functionality. Specifically, a `glmnet` object lives within both functions and also in their outputs (list item `lasso_full`).
The term `plasso` refers to a Post-Lasso model which estimates a least squares algorithm only for the active (i.e. non-zero) coefficients of a previously estimated Lasso models. This follows the idea that we want to do selection but without shrinkage.
The package comes with some simulated data representing the following DGP:
The covariates matrix $X$ consists of 10 variables whose effect size one the target $Y$ is defined by the vector $\boldsymbol{\pi} = [1, -0.83, 0.67, -0.5, 0.33, -0.17, 0, ..., 0]'$ where the first six effect sizes decrease in absolute terms continuously from 1 to 0 and alternate in their sign. The true causal effect of all other covariates is 0. The variables in $X$ follow a normal distribution with mean zero while the covariance matrix follows a Toeplitz matrix, which is characterized by having constant diagonals:
$$
\boldsymbol{\Sigma} = \begin{bmatrix} 1 & 0.7 & 0.7^2 & ... & 0.7^{9} \\ 0.7 & 1 & 0.7 & ... & 0.7^{8} \\ 0.7^2 & 0.7 & 1 & ... & 0.7^{7} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0.7^{9} & 0.7^{8} & 0.7^{7} & ... & 1 \end{bmatrix}
$$
The target $\boldsymbol{y}$ is then a linear transformation of $\boldsymbol{X}$ plus a vector of standard normal random variables. Each element of $\boldsymbol{y}$ is given by:
$$
y_i = \boldsymbol{X}_i \boldsymbol{\pi} + \varepsilon_i
$$
where $\varepsilon_i \sim \mathcal{N}(0,4)$.
```{r data}
data(toeplitz)
y = as.matrix(toeplitz[,1])
X = toeplitz[,-1]
```
# plasso
`plasso` returns least squares estimates for all lambda values of a standard `glmnet` object for both a simple Lasso and a Post-Lasso model.
```{r fitplasso, out.lines=10}
p = plasso::plasso(X,y)
```
You can plot the coefficient paths for both the Post-Lasso model as well as the underlying 'original' Lasso model. This nicely illustrates the difference between the Lasso and Post-Lasso models where the latter is characterized by jumps in its coefficient paths every time a new variable enters the active set.
```{r plotplasso, error=FALSE, warning=FALSE, message=FALSE}
plot(p, lasso=FALSE, xvar="lambda")
plot(p, lasso=TRUE, xvar="lambda")
```
We can also have a look at which coefficients are active for a chosen lambda value. Here, the difference between Post-Lasso and Lasso becomes clearly visible. For the Lasso model, there is not only feature selection but shrinkage which results in the active coefficients being smaller than for the Post-Lasso model:
```{r coefplasso, error=FALSE, warning=FALSE, message=FALSE}
coef_p = coef(p, s=0.01)
as.vector(coef_p$plasso)
as.vector(coef_p$lasso)
```
# cv.plasso
The `cv.plasso` function uses cross-validation to determine the performance of different values for the `lambda` penalty term for both models (Post-Lasso and Lasso). The returned output of class `cv.plasso` includes the mean squared errors.
When applying the `summary` method and setting the `default` parameter as FALSE, you can get some informative output considering the optimal choice of lambda.
```{r fitcvplasso}
p.cv = plasso::cv.plasso(X,y,kf=5)
summary(p.cv, default=FALSE)
```
Using the `plot` method extends the basic `glmnet` visualization by the cross-validated MSEs for the Post-Lasso model.
```{r plotcvplasso, fig.width=7, fig.height=3}
plot(p.cv, legend_pos="left", legend_size=0.5)
```
We can use the following code to get the optimal lambda value (for the Post-Lasso model here) and the associated coefficients at that value of $\lambda$.
```{r index_min_plasso}
p.cv$lambda_min_pl
```
```{r coef_min_plasso, out.lines=10}
coef_pcv = coef(p.cv, S="optimal")
as.vector(coef_pcv$plasso)
```