Finding Common Origins of Milky Way Stars

Author

Andersen Chang\(^*\), Tiffany M. Tang\(^*\), Tarek M. Zikry\(^*\), Genevera I. Allen

Published

June 4, 2025

Loading, Cleaning, and Splitting Data

Let us begin by loading in the APOGEE DR17 data, which was originally downloaded from SDSS (or direct download link here). A detailed data dictionary can be found here.

Show Code
# Load data
data_orig <- load_astro_data(data_dir = DATA_PATH)

This data contains the chemical abundances from 19 different elements as well as metadata about the quality control and star properties. Here, we have also supplemented the data with the APOGEE DR17 value-added catalog of globular clusters (GCs) (Schiavon et al. 2024). GCs are sites of dense star formation, comprised of \(10^5 - 10^7\) stars that were roughly formed around the same time. GCs hence provide a convenient starting point for understanding and assessing the chemical origins of stars in the Milky Way. Nonetheless, this is only a starting point — we note that many stars in the APOGEE survey have been assigned to GCs based on proximity while there is also the broader open question of discovering deeper shared origins of possibly distant bodies.

We next perform some basic quality control filtering. Specifically, following previous literature (Pagnini et al. 2025), we subset the stars to those that meet the criteria below:

  • Signal-to-noise (S/N) ratio threshold greater than 70
  • Effective temperature \(T_{eff}\) ranges between 3500 and 5500
  • Logarithm of the surface gravity \(\log g\) less than 3.6
  • GC membership probability greater than 0.9
  • STARFLAG parameter (indicating poor quality measurements) is set to 0

This quality control filtering resulted in a sample of 3,286 observations (or stars). We will investigate alternative, but equally-reasonable quality control filtering choices in a future section.

Show Code
# QC filtering choices
SNR_THRESHOLD <- 70
MIN_TEFF_THRESHOLD <- 3500
MAX_TEFF_THRESHOLD <- 5500
LOGG_THRESHOLD <- 3.6
STARFLAG <- 0
VB_THRESHOLD <- 0.9

# Apply QC filter
data_qc_filtered <- apply_qc_filtering(
  data_orig,
  snr_threshold = SNR_THRESHOLD,
  teff_thresholds = c(MIN_TEFF_THRESHOLD, MAX_TEFF_THRESHOLD),
  logg_threshold = LOGG_THRESHOLD,
  starflag = STARFLAG,
  vb_threshold = VB_THRESHOLD
)

Below, we provide a quick overview of the data after quality control filtering.

Show Code
abundance_df <- data_qc_filtered$abundance
skimr::skim(abundance_df)
Data summary
Name abundance_df
Number of rows 3286
Number of columns 19
_______________________
Column type frequency:
numeric 19
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
C_FE 4 1.00 -0.17 0.31 -1.48 -0.39 -0.16 0.04 0.85 ▁▂▇▆▁
CI_FE 50 0.98 -0.07 0.35 -1.44 -0.29 -0.07 0.14 0.97 ▁▂▇▆▁
N_FE 3 1.00 0.61 0.47 -0.55 0.27 0.67 0.94 1.96 ▂▅▇▅▁
O_FE 18 0.99 0.28 0.20 -0.57 0.18 0.30 0.40 0.86 ▁▂▆▇▁
NA_FE 729 0.78 0.10 0.51 -1.61 -0.19 0.14 0.43 2.33 ▁▅▇▁▁
MG_FE 2 1.00 0.22 0.17 -0.50 0.15 0.25 0.33 0.81 ▁▂▇▇▁
AL_FE 20 0.99 0.25 0.42 -0.67 -0.12 0.24 0.53 1.38 ▃▆▇▃▂
SI_FE 2 1.00 0.26 0.09 -0.30 0.21 0.26 0.31 0.58 ▁▁▅▇▁
S_FE 106 0.97 0.40 0.23 -0.74 0.30 0.41 0.53 1.12 ▁▁▆▇▁
K_FE 189 0.94 0.24 0.28 -1.02 0.14 0.27 0.38 1.31 ▁▁▇▃▁
CA_FE 68 0.98 0.23 0.18 -0.76 0.16 0.24 0.32 0.90 ▁▁▇▇▁
TI_FE 302 0.91 0.03 0.20 -0.75 -0.08 0.03 0.14 0.96 ▁▃▇▁▁
TIII_FE 208 0.94 0.19 0.25 -0.78 0.08 0.22 0.33 1.02 ▁▁▇▅▁
V_FE 467 0.86 0.09 0.41 -1.32 -0.15 0.10 0.34 1.62 ▁▃▇▂▁
CR_FE 192 0.94 0.02 0.37 -1.07 -0.20 0.01 0.22 1.45 ▁▆▇▂▁
MN_FE 508 0.85 -0.28 0.21 -1.01 -0.41 -0.27 -0.17 0.81 ▁▇▇▁▁
FE_H 2 1.00 -1.44 0.39 -2.42 -1.72 -1.50 -1.16 0.33 ▁▇▃▁▁
CO_FE 286 0.91 -0.08 0.44 -1.42 -0.42 -0.07 0.17 1.52 ▁▅▇▂▁
NI_FE 60 0.98 0.00 0.11 -0.70 -0.06 0.00 0.06 0.62 ▁▁▇▁▁
Show Code
metadata_df <- data_qc_filtered$meta |> 
  dplyr::mutate(
    dplyr::across(where(is.character), as.factor)
  )
skimr::skim(metadata_df)
Data summary
Name metadata_df
Number of rows 3286
Number of columns 15
_______________________
Column type frequency:
factor 6
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
X_H 0 1.00 FALSE 3286 (-0: 1, (-0: 1, (-0: 1, (-0: 1
GC_NAME 0 1.00 FALSE 57 NGC: 1221, NGC: 233, NGC: 228, NGC: 171
APSTAR_ID 0 1.00 FALSE 3286 apo: 1, apo: 1, apo: 1, apo: 1
TELESCOPE 0 1.00 FALSE 2 lco: 2829, apo: 457
TARGFLAGS 146 0.96 FALSE 68 APO: 1141, APO: 959, APO: 224, APO: 171
APOGEE_ID 0 1.00 FALSE 3286 2M0: 1, 2M0: 1, 2M0: 1, 2M0: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
GLON 0 1 231.41 130.85 0.00 59.20 308.90 309.17 357.63 ▃▁▁▁▇
GLAT 0 1 4.10 28.34 -89.47 -11.25 14.84 15.09 79.90 ▁▃▅▇▁
LOCATION_ID 0 1 5438.53 766.57 2011.00 5293.00 5298.00 5644.00 7254.00 ▁▁▁▇▁
STARFLAG 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
VHELIO_AVG 0 1 104.38 158.94 -257.40 -23.62 110.43 231.11 501.77 ▃▅▃▇▁
TEFF_SPEC 0 1 4571.09 303.00 3502.70 4382.40 4621.60 4783.35 5445.70 ▁▃▇▇▁
LOGG_SPEC 0 1 1.56 0.56 -0.19 1.16 1.59 1.98 3.20 ▁▅▇▆▁
VB_PROB 0 1 1.00 0.00 0.90 1.00 1.00 1.00 1.00 ▁▁▁▁▇
SNREV 0 1 209.99 143.20 70.03 119.26 162.09 249.67 1310.13 ▇▁▁▁▁
Show Code
data_qc_filtered$meta |> 
  dplyr::group_by(GC_NAME) |> 
  dplyr::summarise(
    n = dplyr::n(),
    .groups = "drop"
  ) |> 
  dplyr::arrange(dplyr::desc(n)) |>
  vthemes::pretty_DT()

Before proceeding any further, we randomly split the data into training and test using an 80-20% split. This led to a training set of 2,628 observations and a test set of 658 observations.

Show Code
# Split data into training and test sets
TRAIN_PROP <- 0.8

train_idxs <- sample(
  1:nrow(data_qc_filtered$abundance), 
  size = round(TRAIN_PROP * nrow(data_qc_filtered$abundance)),
  replace = FALSE
)
train_data <- data_qc_filtered$abundance[train_idxs, ]
test_data <- data_qc_filtered$abundance[-train_idxs, ]
train_metadata <- data_qc_filtered$meta[train_idxs, ]
test_metadata <- data_qc_filtered$meta[-train_idxs, ]

We next impute missing abundance values in the data and standardize each elemental abundance feature to have mean 0 and standard deviation 1. Given that there are various ways to perform missing data imputation, we implement two different imputation methods: (1) mean imputation and (2) random forest imputation using MissForest (Stekhoven and Bühlmann 2012).

Show Code
# impute missing values and standardize data
data_fpath <- file.path(DATA_PATH, "astro_cleaned_data.RData")
if (!file.exists(data_fpath)) {  # if not already cached
  metadata <- list(
    train = train_metadata,
    test = test_metadata
  )
  
  data_mean_imputed <- clean_astro_data(
    train_data = train_data,
    test_data = test_data,
    impute_mode = "mean"
  )
  
  data_rf_imputed <- clean_astro_data(
    train_data = train_data,
    test_data = test_data,
    impute_mode = "rf"
  )
  
  save(
    metadata, data_mean_imputed, data_rf_imputed, 
    file = file.path(DATA_PATH, "astro_cleaned_data.RData")
  )
} else {  # read from cache
  load(data_fpath)
}

References

Pagnini, Giulia, Paola Di Matteo, Misha Haywood, Alessandra Mastrobuono-Battisti, Florent Renaud, Maëlie Mondelin, Oscar Agertz, et al. 2025. “Abundance Ties: Nephele and the Globular Cluster Population Accreted with \(\omega\) Cen-Based on APOGEE DR17 and Gaia EDR3.” Astronomy & Astrophysics 693: A155.
Schiavon, Ricardo P, Siân G Phillips, Natalie Myers, Danny Horta, Dante Minniti, Carlos Allende Prieto, Borja Anguiano, et al. 2024. “The APOGEE Value-Added Catalogue of Galactic Globular Cluster Stars.” Monthly Notices of the Royal Astronomical Society 528 (2): 1393–1407.
Stekhoven, Daniel J, and Peter Bühlmann. 2012. “MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1): 112–18.