Finding Common Origins of Milky Way Stars

Author

Andersen Chang\(^*\), Tiffany M. Tang\(^*\), Tarek M. Zikry\(^*\), Genevera I. Allen

Published

June 4, 2025

Introduction

The emergence of large spectroscopic surveys of the Milky Way has led to significant interest in studying the chemical origins of the Galaxy’s formation. In particular, researchers are often interested in understanding how stars were formed or evolved chemodynamically over large periods of time. To this end, previous work has shown that in the process of forming a stellar body, parent molecular clouds can produce hundreds of stars in a single burst (Krumholz et al. 2014), but due to astronomical dynamics, these shared origins are challenging to find.

In this case study, we will leverage the Apache Point Observatory Galactic Evolution Experiment (APOGEE) DR17 (Prieto et al. 2008), a large high-resolution spectroscopic survey of stars comprising the disk (i.e., the primary area of the Milky Way’s stellar mass), in order to identify groups of stars in the Milky Way with similar chemical properties and gain insights into the shared origins of stars that were formed together.

Outline

In what follows, we will walk through an unsupervised machine learning workflow for scientific discovery, namely, to discover common origins of stars in the Milky Way. To this end, we will proceed through the following steps:

  • Data Preparation and Cleaning: We begin by loading in the data, performing some basic quality control filtering and cleaning, and splitting the data into a training and test set.
  • Exploratory Data Analysis: We then conduct a brief exploratory data analysis to better understand various characteristics of the data.
  • Dimension Reduction: We further implement various dimension reduction techniques (and tune their hyperparameters) to both visualize the data in a lower-dimensional space and to prepare for clustering.
  • Clustering (training): Next, we fit various clustering techniques on multiple versions of the training data (e.g., using different ways of preparing the data) and perform model selection and hyperparameter tuning based upon the stability of the resulting clusters.
  • Clustering (validation): We then validate the clustering results on the test set (e.g., via generalizability metrics and its stability across alternative data preprocessing pipelines).
  • Interpretation of Clustering Results: Finally, we re-fit the best clustering pipeline on the full data and interpret the final clusters in the context of the scientific question at hand.

References

Krumholz, Mark R, Matthew R Bate, Hector G Arce, James E Dale, Robert Gutermuth, Richard I Klein, Zhi-Yun Li, Fumitaka Nakamura, and Qizhou Zhang. 2014. “Star Cluster Formation and Feedback.” arXiv Preprint arXiv:1401.2473.
Prieto, C Allende, SR Majewski, R Schiavon, K Cunha, P Frinchaboy, J Holtzman, K Johnston, et al. 2008. APOGEE: the Apache point observatory galactic evolution experiment.” Astronomische Nachrichten: Astronomical Notes 329 (9-10): 1018–21.