---
title: "An introduction to plasso"
author:
  - Michael Knaus
  - Stefan Glaisner
date: "`r format(Sys.time(), '%B %d, %Y')`"
bibliography: assets/plasso_refs.bib
link-citations: true
output:
  rmarkdown::html_vignette:
    fig_width: 8
    fig_height: 6
    toc: true
    code_folding: show
vignette: >
  %\VignetteIndexEntry{An introduction to plasso}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include=FALSE}

rm(list=ls())
gc()

hook_output <- knitr::knit_hooks$get("output")

knitr::knit_hooks$set(output = function(x, options) {
  if (!is.null(n <- options$out.lines)) {
    x <- xfun::split_lines(x)
    if (length(x) > n) {
      x <- c(head(x, n), "....\n")
    }
    x <- paste(x, collapse = "\n")
  }
  hook_output(x, options)
})

```

This notebook provides a detailed overview over the `plasso` package and its two main functions `plasso` and `cv.plasso` which were developed in the course of @knaus. This package is strongly oriented around the `glmnet` package and rests on its standard function `glmnet` in its very basis. Related theory and algorithms are described in @glmnet.


# Getting started

The very latest version of the package can be installed from its [Github page](https://github.com/stefan-1997/plasso). For the installation you will need the `devtools` package. The latest 'official' version can be installed from CRAN using `install.packages()'. We recommend the latter.

General dependencies are: `glmnet`, `Matrix`, `methods`, `parallel`, `doParallel`, `foreach` and `iterators`.

```{r startup, eval=FALSE, error=FALSE, warning=FALSE, message=FALSE}

library(devtools)
devtools::install_github("stefan-1997/plasso")

install.packages("plasso")

```

Load `plasso` using `library()`.

```{r load}

library(plasso)

```

The package generally provides two functions `plasso` and `cv.plasso` which are both built on top of the `glmnet` functionality. Specifically, a `glmnet` object lives within both functions and also in their outputs (list item `lasso_full`).

The term `plasso` refers to a Post-Lasso model which estimates a least squares algorithm only for the active (i.e. non-zero) coefficients of a previously estimated Lasso models. This follows the idea that we want to do selection but without shrinkage.

The package comes with some simulated data representing the following DGP:

The covariates matrix $X$ consists of 10 variables whose effect size one the target $Y$ is defined by the vector $\boldsymbol{\pi} = [1, -0.83, 0.67, -0.5, 0.33, -0.17, 0, ..., 0]'$ where the first six effect sizes decrease in absolute terms continuously from 1 to 0 and alternate in their sign. The true causal effect of all other covariates is 0. The variables in $X$ follow a normal distribution with mean zero while the covariance matrix follows a Toeplitz matrix, which is characterized by having constant diagonals:
$$
\boldsymbol{\Sigma} = \begin{bmatrix} 1 & 0.7 & 0.7^2 & ... & 0.7^{9} \\ 0.7 & 1 & 0.7 & ... & 0.7^{8} \\ 0.7^2 & 0.7 & 1 & ... & 0.7^{7} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0.7^{9} & 0.7^{8} & 0.7^{7} & ... & 1 \end{bmatrix}
$$

The target $\boldsymbol{y}$ is then a linear transformation of $\boldsymbol{X}$ plus a vector of standard normal random variables. Each element of $\boldsymbol{y}$ is given by:
$$
y_i = \boldsymbol{X}_i \boldsymbol{\pi} + \varepsilon_i
$$
where $\varepsilon_i \sim \mathcal{N}(0,4)$.

```{r data}

data(toeplitz)

y = as.matrix(toeplitz[,1])
X = toeplitz[,-1]

```

# plasso

`plasso` returns least squares estimates for all lambda values of a standard `glmnet` object for both a simple Lasso and a Post-Lasso model.

```{r fitplasso, out.lines=10}

p = plasso::plasso(X,y)

```

You can plot the coefficient paths for both the Post-Lasso model as well as the underlying 'original' Lasso model. This nicely illustrates the difference between the Lasso and Post-Lasso models where the latter is characterized by jumps in its coefficient paths every time a new variable enters the active set.

```{r plotplasso, error=FALSE, warning=FALSE, message=FALSE}

plot(p, lasso=FALSE, xvar="lambda")

plot(p, lasso=TRUE, xvar="lambda")

```

We can also have a look at which coefficients are active for a chosen lambda value. Here, the difference between Post-Lasso and Lasso becomes clearly visible. For the Lasso model, there is not only feature selection but shrinkage which results in the active coefficients being smaller than for the Post-Lasso model:

```{r coefplasso, error=FALSE, warning=FALSE, message=FALSE}

coef_p = coef(p, s=0.01)

as.vector(coef_p$plasso)
as.vector(coef_p$lasso)

```


# cv.plasso

The `cv.plasso` function uses cross-validation to determine the performance of different values for the `lambda` penalty term for both models (Post-Lasso and Lasso). The returned output of class `cv.plasso` includes the mean squared errors.

When applying the `summary` method and setting the `default` parameter as FALSE, you can get some informative output considering the optimal choice of lambda.

```{r fitcvplasso}

p.cv = plasso::cv.plasso(X,y,kf=5)
summary(p.cv, default=FALSE)

```

Using the `plot` method extends the basic `glmnet` visualization by the cross-validated MSEs for the Post-Lasso model.

```{r plotcvplasso, fig.width=7, fig.height=3}

plot(p.cv, legend_pos="left", legend_size=0.5)

```

We can use the following code to get the optimal lambda value (for the Post-Lasso model here) and the associated coefficients at that value of $\lambda$.

```{r index_min_plasso}

p.cv$lambda_min_pl

```

```{r coef_min_plasso, out.lines=10}

coef_pcv = coef(p.cv, S="optimal")
as.vector(coef_pcv$plasso)

```

<br>
<br>