convex_clustering calculates the convex clustering solution path at a user-specified grid of lambda values (or just a single value). It is, in general, difficult to know a useful set of lambda values a priori, so this function is more useful for timing comparisons and methodological research than applied work.

convex_clustering(
  X,
  ...,
  lambda_grid,
  weights = sparse_rbf_kernel_weights(k = "auto", phi = "auto", dist.method =
    "euclidean", p = 2),
  X.center = TRUE,
  X.scale = FALSE,
  norm = 2,
  impute_func = function(X) {     if (anyNA(X))          missForest(X)$ximp     else X
    },
  status = (interactive() && (clustRviz_logger_level() %in% c("MESSAGE", "WARNING",
    "ERROR")))
)

Arguments

X

The data matrix (\(X \in R^{n \times p}\)): rows correspond to the observations (to be clustered) and columns to the variables (which will not be clustered). If X has missing values - NA or NaN values - they will be automatically imputed.

...

Unused arguements. An error will be thrown if any unrecognized arguments as given. All arguments other than X must be given by name.

lambda_grid

A user-supplied set of \(\lambda\) values at which to solve the convex clustering problem. These must be strictly positive values and will be automatically sorted internally.

weights

One of the following:

  • A function which, when called with argument X, returns an b-by-n matrix of fusion weights.

  • A matrix of size n-by-n containing fusion weights

X.center

A logical: Should X be centered columnwise?

X.scale

A logical: Should X be scaled columnwise?

norm

Which norm to use in the fusion penalty? Currently only 1 and 2 (default) are supported.

impute_func

A function used to impute missing data in X. By default, the missForest function from the package of the same name is used. This provides a flexible potentially non-linear imputation function. This function has to return a data matrix with no NA values. Note that, consistent with base R, both NaN and NA are treaded as "missing values" for imputation.

status

Should a status message be printed to the console?

Value

An object of class convex_clustering containing the following elements (among others):

  • X: the original data matrix

  • n: the number of observations (rows of X)

  • p: the number of variables (columns of X)

  • X.center: a logical indicating whether X was centered column-wise before clustering

  • X.scale: a logical indicating whether X was scaled column-wise before centering

  • weight_type: a record of the scheme used to create fusion weights

  • U: a tensor (3-array) of clustering solutions

Details

Compared to the CARP function, the returned object is much more "bare-bones," containing only the estimated \(U\) matrices, and no information used for dendrogram or path visualizations.

Examples

clustering_fit <- convex_clustering(presidential_speech[1:10,1:4], lambda_grid = 1:100)
#> Pre-computing weights and edge sets
#> Computing Convex Clustering Solutions
#> Post-processing
print(clustering_fit)
#> Convex Clustering Fit Summary #> ============================= #> #> Algorithm: ADMM [L2] #> Grid: 101 values of lambda. #> Fit Time: 0.004 secs #> Total Time: 0.009 secs #> #> Number of Observations: 10 #> Number of Variables: 4 #> #> Pre-processing options: #> - Columnwise centering: TRUE #> - Columnwise scaling: FALSE #> #> Weights: #> - Source: Radial Basis Function Kernel Weights #> - Distance Metric: Euclidean #> - Scale parameter (phi): 0.1 [Data-Driven] #> - Sparsified: 2 Nearest Neighbors [Data-Driven] #>