Fit a GLM with Exclusive Lasso Regularization — exclusive

Fit a generalized linear model via maximum penalized likelihood using the exclusive lasso penalty. The regularization path is computed along a grid of values for the regularization parameter (lambda). The interface is intentionally similar to that of glmnet in the package of the same name.

exclusive_lasso(
  X,
  y,
  groups,
  family = c("gaussian", "binomial", "poisson"),
  weights,
  offset,
  nlambda = 100,
  lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04),
  lambda,
  standardize = TRUE,
  intercept = TRUE,
  lower.limits = rep(-Inf, nvars),
  upper.limits = rep(Inf, nvars),
  thresh = 1e-07,
  thresh_prox = thresh,
  skip_df = FALSE,
  algorithm = c("cd", "pg")
)

Arguments

X	The matrix of predictors ($X \in \R^{n \times p}$)
y	The response vector ($y$)
groups	An integer vector of length $p$ indicating group membership. (Cf. the `group` argument of `grpreg`)
family	The GLM response type. (Cf. the `family` argument of `glm`)
weights	Weights applied to individual observations. If not supplied, all observations will be equally weighted. Will be re-scaled to sum to $n$ if necessary. (Cf. the `weight` argument of `lm`)
offset	A vector of length $n$ included in the linear predictor.
nlambda	The number of lambda values to use in computing the regularization path. Note that the time to run is typically sublinear in the grid size due to the use of warm starts.
lambda.min.ratio	The smallest value of lambda to be used, as a fraction of the largest value of lambda used. Unlike the lasso, there is no value of lambda such that the solution is wholly sparse, but we still use lambda_max from the lasso.
lambda	A user-specified sequence of lambdas to use.
standardize	Should `X` be centered and scaled before fitting?
intercept	Should the fitted model have an (unpenalized) intercept term?
lower.limits	A vector of lower bounds for each coefficient (default `-Inf`). Can either be a scalar (applied to each coefficient) or a vector of length `p` (number of coefficients).
upper.limits	A vector of lower bounds for each coefficient (default `Inf`). Can either be a scalar (applied to each coefficient) or a vector of length `p` (number of coefficients).
thresh	The convergence threshold used for the proximal gradient or coordinate-descent algorithm used to solve the penalized regression problem.
thresh_prox	The convergence threshold used for the coordinate-descent algorithm used to evaluate the proximal operator.
skip_df	Should the DF calculations be skipped? They are often slower than the actual model fitting; if calling `exclusive_lasso` repeatedly it may be useful to skip these calculations.
algorithm	Which algorithm to use, proximal gradient (`"pg"`) or coordinate descent (`"cd"`)? Empirically, coordinate descent appears to be faster for most problems (consistent with Campbell and Allen), but proximal gradient may be faster for certain problems with many small groups where the proximal operator may be evaluated quickly and to high precision.

Value

An object of class ExclusiveLassoFit containing

coef - A matrix of estimated coefficients
intercept - A vector of estimated intercepts if intercept=TRUE
X, y, groups, weights, offset - The data used to fit the model
lambda - The vector of $\lambda$ used
df - An unbiased estimate of the degrees of freedom (see Theorem 5 in [1])
nnz - The number of non-zero coefficients at each value of $\lambda$

Details

Note that unlike Campbell and Allen (2017), we use the "1/n"-scaling of the loss function.

For the Gaussian case: $$\frac{1}{2n}|y - X\beta|_2^2 + \lambda P(\beta, G)$$

For other GLMs: $$-\frac{1}{n}\ell(y, X\beta)+ \lambda P(\beta, G)$$

By default, an optimized implementation is used for family="gaussian" which is approximately 2x faster for most problems. If you wish to disable this code path and use the standard GLM implementation with Gaussian response, set options(ExclusiveLasso.gaussian_fast_path=FALSE).

References

Campbell, Frederick and Genevera I. Allen. "Within Group Variable Selection with the Exclusive Lasso". Electronic Journal of Statistics 11(2), pp.4220-4257. 2017. doi: 10.1214/17-EJS1317

Examples

n <- 200
p <- 500
groups <- rep(1:10, times=50)
beta <- numeric(p);
beta[1:10] <- 3

X <- matrix(rnorm(n * p), ncol=p)
y <- X %*% beta + rnorm(n)

exfit <- exclusive_lasso(X, y, groups)

X	The matrix of predictors (\(X \in \R^{n \times p}\))
y	The response vector (\(y\))
groups	An integer vector of length \(p\) indicating group membership. (Cf. the `group` argument of `grpreg`)
family	The GLM response type. (Cf. the `family` argument of `glm`)
weights	Weights applied to individual observations. If not supplied, all observations will be equally weighted. Will be re-scaled to sum to \(n\) if necessary. (Cf. the `weight` argument of `lm`)
offset	A vector of length \(n\) included in the linear predictor.
nlambda	The number of lambda values to use in computing the regularization path. Note that the time to run is typically sublinear in the grid size due to the use of warm starts.
lambda.min.ratio	The smallest value of lambda to be used, as a fraction of the largest value of lambda used. Unlike the lasso, there is no value of lambda such that the solution is wholly sparse, but we still use lambda_max from the lasso.
lambda	A user-specified sequence of lambdas to use.
standardize	Should `X` be centered and scaled before fitting?
intercept	Should the fitted model have an (unpenalized) intercept term?
lower.limits	A vector of lower bounds for each coefficient (default `-Inf`). Can either be a scalar (applied to each coefficient) or a vector of length `p` (number of coefficients).
upper.limits	A vector of lower bounds for each coefficient (default `Inf`). Can either be a scalar (applied to each coefficient) or a vector of length `p` (number of coefficients).
thresh	The convergence threshold used for the proximal gradient or coordinate-descent algorithm used to solve the penalized regression problem.
thresh_prox	The convergence threshold used for the coordinate-descent algorithm used to evaluate the proximal operator.
skip_df	Should the DF calculations be skipped? They are often slower than the actual model fitting; if calling `exclusive_lasso` repeatedly it may be useful to skip these calculations.
algorithm	Which algorithm to use, proximal gradient (`"pg"`) or coordinate descent (`"cd"`)? Empirically, coordinate descent appears to be faster for most problems (consistent with Campbell and Allen), but proximal gradient may be faster for certain problems with many small groups where the proximal operator may be evaluated quickly and to high precision.