Finding Common Origins of Milky Way Stars

Author

Andersen Chang, Tiffany M. Tang, Tarek M. Zikry, Genevera I. Allen

Published

June 4, 2025

Dimension Reduction

We next explore various dimension reduction techniques for both visualization purposes and for possibly reducing the data dimensions prior to clustering.

Dimension Reduction Methods Under Study:

  • Principal Component Analysis (PCA)
  • tSNE with 2 dimensions and perplexity = 10, 30, 60, 100, or 300
  • UMAP with 2 dimensions and number of neighbors = 10, 30, 60, 100, or 300

Each of these dimension reduction methods was applied either to (i) the small set of 7 chemical abundance features (\(FE_{H}, MG_{FE}, O_{FE}, SI_{FE}, CA_{FE}, NI_{FE}, AL_{FE}\)), (ii) the medium set of 11 chemical abundance features ((i) plus \(C_{FE}, MN_{FE}, N_{FE}, K_{FE}\)), or (iii) the full set of 19 chemical abundance features ((ii) plus \(CI_{FE}, NA_{FE}, S_{FE}, TI_{FE},\) \(TIII_{FE}, V_{FE}, CR_{FE}, CO_{FE}\)).

Show Code to Fit Dimension Reduction Methods
## this code chunk fits the dimension reduction methods

# select dimension reduction hyperparameter grids
TSNE_PERPLEXITIES <- c(10, 30, 60, 100, 300)
UMAP_N_NEIGHBORS <- c(10, 30, 60, 100, 300)

# select dimension reduction methods
dr_fun_ls <- c(
  list("PCA" = fit_pca),
  purrr::map(
    TSNE_PERPLEXITIES,
    ~ purrr::partial(fit_tsne, dims = 2, perplexity = .x)
  ) |> 
    setNames(sprintf("tSNE (perplexity = %d)", TSNE_PERPLEXITIES)),
  purrr::map(
    UMAP_N_NEIGHBORS,
    ~ purrr::partial(fit_umap, dims = 2, n_neighbors = .x
    )
  ) |> 
    setNames(sprintf("UMAP (n_neighbors = %d)", UMAP_N_NEIGHBORS))
)

fit_results_fname <- file.path(RESULTS_PATH, "dimension_reduction_fits.rds")
if (!file.exists(fit_results_fname)) {
  # fit dimension reduction methods (if not already cached)
  dr_fit_ls <- purrr::map(
    train_data_ls,
    function(train_data) {
      purrr::map(dr_fun_ls, function(dr_fun) dr_fun(train_data))
    }
  )
  # save dimension reduction fits
  saveRDS(dr_fit_ls, file = fit_results_fname)
} else {
  # read in dimension reduction fits (if already cached)
  dr_fit_ls <- readRDS(fit_results_fname)
}

# aggregate all dimension reduction results into one df
plt_df <- purrr::list_flatten(dr_fit_ls, name_spec = "{inner} [{outer}]") |> 
  purrr::map(
    ~ .x$scores[, 1:2] |> 
      setNames(sprintf("Component %d", 1:2)) |> 
      dplyr::bind_cols(
        metadata$train |> dplyr::select(GC_NAME, GLAT, GLON)
      ) |> 
      dplyr::mutate(
        id = 1:dplyr::n()
      )
  ) |> 
  dr_results_to_df()

Hyperparameter Tuning: To tune hyperparameters in tSNE and UMAP, we use the neighborhood retention metric, which measures the fraction of its \(k\) nearest neighbors retained in the dimension-reduced space compared to the original space. A higher neighborhood retention rate indicates a better dimension reduction method.

Show Code to Tune/Evaluate Dimension Reduction Methods
## this code chunk evaluates the dimension reduction methods

# evaluate neighborhood retention metric
Ks <- c(1, 5, 10, 25, 50, 100, 200, 300)

eval_results_fname <- file.path(RESULTS_PATH, "dimension_reduction_eval.rds")
if (!file.exists(eval_results_fname)) {
  # evaluate neighborhood retention (if not already cached)
  dr_eval_ls <- purrr::imap(
    dr_fit_ls,
    function(dr_out, key) {
      purrr::map(
        dr_out, 
        function(.x) {
          eval_neighborhood_retention(
            orig_data = train_data_ls[[key]],
            dr_data = .x$scores[, 1:min(ncol(.x$scores), 4)],
            ks = Ks
          )
        }
      )
    }
  )
  # save dimension reduction evaluation results
  saveRDS(dr_eval_ls, file = eval_results_fname)
} else {
  # read in dimension reduction evaluation results (if already cached)
  dr_eval_ls <- readRDS(eval_results_fname)
}

eval_plt_df <- purrr::list_flatten(
  dr_eval_ls, name_spec = "{inner} [{outer}]"
) |> 
  dr_results_to_df()

Hyperparameter Tuning via Neighborhood Retention

Below, we show an interactive plot (using plotly) of the neighborhood retention with varied number of neighbors \(k\) for each dimension reduction technique, applied to the various sets of abundance features (small, medium, and big).

Main Takeaways:

  • Regardless of the choice of number of neighbors \(k\) or the hyperparameter in UMAP, UMAP almost always yields worse neighborhood retention than tSNE (e.g., with perplexity = 100).
  • Of the tSNE fits, there is no single tSNE hyperparameter that uniformly outperforms the others. However, tSNE with perplexity = 30 and 100 appear to strike a healthy balance between preserving local (i.e., small neighborhood sizes) and global (i.e., large neighborhood sizes) structure in the data, with perplexity = 30 performing slightly better at preserving local structure and perplexity = 100 performing slightly better at preserving global structure.
  • PCA expectedly yields the best neighborhood retention at large neighborhood sizes and is the best at preserving global structure in the data.

Note: try clicking on the legend to toggle the visibility of different dimension reduction methods.

Dimension Reduction Plots

We provide various visualizations of the dimension reduction results below, including:

  • Scatter plots of the first two components from each dimension reduction method, colored by the star’s GC
  • (Jittered) galactic coordinate plots of the stars, colored by the first and second components from each dimension reduction method
  • Heatmaps of the principal component loadings from PCA

Dimension reduction visualizations, colored by GC.

Galactic coordinates plot (jittered), colored by value of the first component from dimension reduction method.

Galactic coordinates plot (jittered), colored by value of the first component from dimension reduction method.

Galactic coordinates plot (jittered), colored by value of the second component from dimension reduction method.

Galactic coordinates plot (jittered), colored by value of the second component from dimension reduction method.

Principal Component Loadings

Principal Component Loadings