CARP returns a fast approximation to the Convex Clustering solution path along with visualizations such as dendrograms and cluster paths. CARP solves the Convex Clustering problem via an efficient Algorithmic Regularization scheme.

CARP(
  X,
  ...,
  weights = sparse_rbf_kernel_weights(k = "auto", phi = "auto", dist.method =
    "euclidean", p = 2),
  labels = rownames(X),
  X.center = TRUE,
  X.scale = FALSE,
  back_track = FALSE,
  exact = FALSE,
  norm = 2,
  t = 1.05,
  npcs = min(4L, NCOL(X), NROW(X)),
  dendrogram.scale = NULL,
  impute_func = function(X) {     if (anyNA(X))          missForest(X)$ximp     else X
    },
  status = (interactive() && (clustRviz_logger_level() %in% c("MESSAGE", "WARNING",
    "ERROR")))
)

Arguments

X

The data matrix (\(X \in R^{n \times p}\)): rows correspond to the observations (to be clustered) and columns to the variables (which will not be clustered). If X has missing values - NA or NaN values - they will be automatically imputed.

...

Unused arguements. An error will be thrown if any unrecognized arguments as given. All arguments other than X must be given by name.

weights

One of the following:

  • A function which, when called with argument X, returns an b-by-n matrix of fusion weights.

  • A matrix of size n-by-n containing fusion weights

labels

A character vector of length \(n\): observations (row) labels

X.center

A logical: Should X be centered columnwise?

X.scale

A logical: Should X be scaled columnwise?

back_track

A logical: Should back-tracking be used to exactly identify fusions? By default, back-tracking is not used.

exact

A logical: Should the exact solution be computed using an iterative algorithm? By default, algorithmic regularization is applied and the exact solution is not computed. Setting exact = TRUE often significantly increases computation time.

norm

Which norm to use in the fusion penalty? Currently only 1 and 2 (default) are supported.

t

A number greater than 1: the size of the multiplicative update to the cluster fusion regularization parameter (not used by back-tracking variants). Typically on the scale of 1.005 to 1.1.

npcs

An integer >= 2. The number of principal components to compute for path visualization.

dendrogram.scale

A character string denoting how the scale of dendrogram regularization proportions should be visualized. Choices are 'original' or 'log'; if not provided, a data-driven heuristic choice is used.

impute_func

A function used to impute missing data in X. By default, the missForest function from the package of the same name is used. This provides a flexible potentially non-linear imputation function. This function has to return a data matrix with no NA values. Note that, consistent with base R, both NaN and NA are treaded as "missing values" for imputation.

status

Should a status message be printed to the console?

Value

An object of class CARP containing the following elements (among others):

  • X: the original data matrix

  • n: the number of observations (rows of X)

  • p: the number of variables (columns of X)

  • alg.type: the CARP variant used

  • X.center: a logical indicating whether X was centered column-wise before clustering

  • X.scale: a logical indicating whether X was scaled column-wise before centering

  • weight_type: a record of the scheme used to create fusion weights

Examples

carp_fit <- CARP(presidential_speech[1:10,1:4])
#> Pre-computing weights and edge sets
#> Computing Convex Clustering [CARP] Path
#> Post-processing
print(carp_fit)
#> CARP Fit Summary #> ==================== #> #> Algorithm: CARP (t = 1.05) #> Fit Time: 0.008 secs #> Total Time: 1.025 secs #> #> Number of Observations: 10 #> Number of Variables: 4 #> #> Pre-processing options: #> - Columnwise centering: TRUE #> - Columnwise scaling: FALSE #> #> Weights: #> - Source: Radial Basis Function Kernel Weights #> - Distance Metric: Euclidean #> - Scale parameter (phi): 0.1 [Data-Driven] #> - Sparsified: 2 Nearest Neighbors [Data-Driven] #>
plot(carp_fit)