Obtaining Probabilistic GC Catalogs

Last updated: 2024-07-06

Checks: 7 0

Knit directory: MATHPOP/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20240702)

The command set.seed(20240702) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: eda1cf3

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version eda1cf3. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    analysis/.DS_Store
    Ignored:    data/.DS_Store
    Ignored:    data/GC_prob/.DS_Store
    Ignored:    data/point_source_data/.DS_Store
    Ignored:    data/prob_GC_data/.DS_Store
    Ignored:    data/sim/.DS_Store
    Ignored:    data/v10acs/.DS_Store
    Ignored:    data/v10wfc3/.DS_Store
    Ignored:    data/v11acs/.DS_Store
    Ignored:    data/v11wfc3/.DS_Store
    Ignored:    data/v12acs/.DS_Store
    Ignored:    data/v12wfc3/.DS_Store
    Ignored:    data/v13acs/.DS_Store
    Ignored:    data/v13wfc3/.DS_Store
    Ignored:    data/v14acs/.DS_Store
    Ignored:    data/v14wfc3/.DS_Store
    Ignored:    data/v15acs/.DS_Store
    Ignored:    data/v15wfc3/.DS_Store
    Ignored:    data/v6acs/.DS_Store
    Ignored:    data/v6wfc3/.DS_Store
    Ignored:    data/v7acs/.DS_Store
    Ignored:    data/v8acs/.DS_Store
    Ignored:    data/v9acs/.DS_Store
    Ignored:    data/v9wfc3/.DS_Store

Untracked files:
    Untracked:  GC_prob-tikzDictionary
    Untracked:  analysis/GC_prob-tikzDictionary
    Untracked:  analysis/data_process-tikzDictionary
    Untracked:  analysis/method_compare-tikzDictionary
    Untracked:  analysis/posterior-tikzDictionary
    Untracked:  analysis/vignette-tikzDictionary
    Untracked:  data_process-tikzDictionary
    Untracked:  method_compare-tikzDictionary
    Untracked:  posterior-tikzDictionary
    Untracked:  vignette-tikzDictionary

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/data_process.Rmd) and HTML (docs/data_process.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	0b7fd29	david.li	2024-07-06	wflow_git_commit(all = T)
html	0a9d403	david.li	2024-07-05	Build site.
html	c1fd386	david.li	2024-07-04	Build site.
Rmd	33fc566	david.li	2024-07-04	wflow_git_commit(all = T)
html	33fc566	david.li	2024-07-04	wflow_git_commit(all = T)
html	6ff1ccc	david.li	2024-07-04	Build site.
html	65a7d8c	david.li	2024-07-04	Build site.
Rmd	98834e5	david.li	2024-07-04	wflow_publish(c("analysis/_site.yml", "analysis/posterior.Rmd",

Introduction

This vignette illustrates the methods used to obtain the probabilistic catalogs obtained using point sources from DOLPHOT and SExtractor. It also reproduces Figure 1 and Figure 8 in the original paper of Li et al. (2024)

Read in and Pre-Process Point Source Data

Read in the point source data obtained by DOLPHOT and SExtractor:

# Load packages
library(tidyverse)
library(tikzDevice)
library(mixtools)
library(Rcpp)

sourceCpp('code/cpp_help_func.cpp')
source('code/help_functions.R')

# Read in DOLPHOT point source data
fnames <- list.files('data/point_source_data/')
ps_harris_fnames <- fnames[!grepl('Jans', fnames)]
dat_harris <- data.frame()
for (i in ps_harris_fnames) {
  df <- read_csv(paste0('data/point_source_data/',i))
  df <- bind_cols(df, field = rep(i, nrow(df)))
  dat_harris <- bind_rows(dat_harris, df)
}

# Read in SExtractor point source data
ps_Jans_fnames <- fnames[grepl('Jans', fnames)]
dat_Jans <- read_csv(paste0('data/point_source_data/', ps_Jans_fnames))

We then carry out pre-processing of the point source data. For the DOLPHOT data, we remove point sources with color \(\mathrm{F475W} - \mathrm{F814W} < 0.0~\mathrm{mag}\) and \(\mathrm{F814W} < 22.0~\mathrm{mag}\). Pre-processing of SExtractor data is not needed as the method to obtain the probabilistic GC catalog under SExtractor point source is different.

CMD_harris <- mutate(dat_harris, C = F475W - F814W, M = F814W) %>%
  dplyr::select(x, y, RA, DEC, C, M, F814W, F475W, field) %>%
  filter(M > 22 & C > 0)

Obtaining the Probability of a Source being GC

DOLPHOT

For the DOLPHOT point source, the finite-mixture method to obtain the probabilistic GC catalog is illustrated here. We have carried out this method 500 times where each time the color and magnitude are jittered with their measurement uncertainties. As it is impossible to run and save such a big computational task remotely, we have saved the locally-run results in the file DOL_GC_prob_mvnpEM_500_iter.RDS under the data/GC_prob/ folder of the Github repo. Read in this data and process it:

# read in results from 500 iteration of the finite-mixture method
GC_prob_ls_Harris <- readRDS('data/GC_prob/DOL_GC_prob_mvnpEM_500_iter.RDS')

# extract the probability a source is a GC
GC_probability_Harris <- do.call(cbind,lapply(GC_prob_ls_Harris, function(x) x$posterior[, which.max(x$lambdahat)]))

# name the column
colnames(GC_probability_Harris) <- c(paste0('p', 1:500))

# add it to the previous point source data.frame
CMD_harris <- bind_cols(CMD_harris, GC_probability_Harris)

# obtain summary statistics of the probability from the 500 iterations
CMD_harris$p <- rowMeans(GC_probability_Harris)
CMD_harris$p_sd <- apply(GC_probability_Harris, 1, sd)

After obtaining the point source data with all 500 probabilities, we save them into individual files based on their imaging field ID for later inference.

# Only run once
for (i in ps_harris_fnames) {
  write_csv(filter(CMD_harris, field == i), paste0('data/prob_GC_data/',unlist(strsplit(i, split = '.', fixed = T))[1],'_pGC.csv'))
}

SExtractor

For the SExtractor data, a different parametric finite-mixture model was used. For a detailed explanation of the method, see Appendix B.2 of the original Li et al. (2024) paper. We have written a function called mix_func() in the code/help_functions.R to fit the parametric finite-mixture model.

Similar to DOLPHOT data, we apply our proposed parametric finite-mixture model to the measurement uncertainty jittered color-magnitude data for 500 times, this is carried out by the function meas_uncertain_mix_func():

# This may take a while, but still much faster than the previous non-parametric finite mixture model used for DOLPHOT
Jans_res <- meas_uncertain_mix_func(dat = dat_Jans, n_iter = 500)

Get the results from the previous run, and attach it to the SExtractor point source data.frame.

# get the probabilities
p_mat <- Jans_res$prob

# summary statistics for the probability a point source is a GC
dat_Jans$p <- rowMeans(Jans_res$prob)
dat_Jans$p_sd <- apply(Jans_res$prob, 1, sd)

# restructure the data frame for later inference
dat_Jans_save <- dat_Jans %>%
  dplyr::select(x, y, RA, DEC, C, F814W, field) %>%
  mutate(M = F814W) %>%
  dplyr::select(x, y, RA, DEC, C, M, field)

colnames(Jans_res$prob) <- c(paste0('p', 1:500))
Jans_res$prob <- data.frame(Jans_res$prob)

dat_Jans_save <- bind_cols(dat_Jans_save, p_mat)

Again, we save the probabilistic GC catalog from SExtractor into individual files based on their imaging field ID for later inference.

# save the probabilistic GC catalog into individual files (only run once)
for (i in unique(dat_Jans_save$field)) {
  write_csv(filter(dat_Jans_save, field == i), paste0('data/prob_GC_data/', i,'_pGC_Jans.csv'))
}

Probabilities of Sources being GCs

Now we can plot the obtained data and regenerate Figure 1 in the original Li et al. (2024) paper. Note that due to Latex memory constraints, not all point sources can be plotted in the figure, so we only plotted half of the sources.

# combine the two point source data in one data frame
frac_CMD_harris <- CMD_harris %>%
  dplyr::select(C, M, p, p_sd) %>%
  mutate(data = 'DOLPHOT')

frac_CMD_Jans <- dat_Jans %>%
  dplyr::select(C, F814W, p, p_sd) %>%
  mutate(data = 'SExtractor (J24)', M = F814W) %>%
  dplyr::select(C, M, p, p_sd, data)

# select half of the sources to be plotted
set.seed(123456)
frac_CMD <- bind_rows(frac_CMD_harris[sample(1:nrow(frac_CMD_harris), 0.5*nrow(frac_CMD_harris), replace = F),],
                      frac_CMD_Jans[sample(1:nrow(frac_CMD_Jans), 0.5*nrow(frac_CMD_Jans), replace = F),])

ggplot(frac_CMD, aes(C, M, color = p)) + geom_point(aes(size = p_sd*2.5)) +
  scale_size_identity() +
  scale_y_reverse() + theme_minimal() +
  xlab('F475W - F814W') + ylab('F814W') +
  scale_colour_viridis_c(name = 'p(GC)') + 
  geom_segment(data = data.frame(xmin = c(0.8, 0.8, 2.4, 0.8), 
                                               xmax = c(2.4, 0.8, 2.4, 2.4), 
                                               ymin = c(26.3, 22, 22, 22), 
                                               ymax = c(26.3, 26.3, 26.3, 22),
                                 data = 'SExtractor (J24)'), 
           aes(x = xmin, xend = xmax, y = ymin, yend = ymax), colour = 'red') +
  geom_segment(data = data.frame(xmin = c(1, 1, 2.4, 1), 
                                 xmax = c(2.4, 1, 2.4, 2.4), 
                                 ymin = c(26.5, 22, 22, 22), 
                                 ymax = c(26.5, 26.5, 26.5, 22),
                                 data = 'DOLPHOT'), 
               aes(x = xmin, xend = xmax, y = ymin, yend = ymax), colour = 'red') +
  facet_wrap(.~data)

Version	Author	Date
c1fd386	david.li	2024-07-04
33fc566	david.li	2024-07-04
65a7d8c	david.li	2024-07-04

Fitted Results of Parametric Finite-Mixture Model for SExtractor Data

We can now also plot the results for the fitted parametric finite-mixture model for SExtractor point source data. These are effectively Figure 8(a) and 8(b) in the paper. First, get the estimated parameters from the fitted results:

# average of the fitted mixture model model parameters
w <- mean(Jans_res$par[,1])
wr <- mean(Jans_res$par[,2])
GCLF_TO <- 26.3 # canonical GCLF TO
GCLF_sig <- mean(Jans_res$par[,3])
GC_color_mu_r <- mean(Jans_res$par[,4])
GC_color_sig_r <- mean(Jans_res$par[,5])
GC_color_mu_b <- mean(Jans_res$par[,6])
GC_color_sig_b <- mean(Jans_res$par[,7])
mu <- mean(Jans_res$par[,8])
sigma <- mean(Jans_res$par[,9])
Phi <- Phi_f_cpp(26.69, GCLF_TO, GCLF_sig, a = 6.56)

Generate Figure 8a in the original paper of Li et al. (2024):

ggplot(Jans_res$sim, aes(F814W)) + geom_histogram(aes(y = after_stat(count)/sum(count)/.05253/2),breaks = seq(22, 27.2, length = 50), alpha = 0.5) +
  geom_function(fun = function(x) w*dnorm(x, GCLF_TO, GCLF_sig)*f_cpp(x, 26.69, 6.56)/Phi + (1-w)*dnorm(x, mu, sigma)) +
  geom_function(fun = function(x) w*dnorm(x, GCLF_TO, GCLF_sig)*f_cpp(x, 26.69, 6.56)/Phi, color = "#EBCC2A") +
  geom_function(fun = function(x) (1-w)*dnorm(x, mu, sigma), color = 'purple') + ylab('Density') + theme_minimal()

Version	Author	Date
33fc566	david.li	2024-07-04
65a7d8c	david.li	2024-07-04

Generate Figure 8b in the original paper of Li et al. (2024):

ggplot(Jans_res$sim, aes(C)) + geom_histogram(aes(y = after_stat(count)/sum(count)/0.032),breaks = seq(0.8, 2.4, length = 50), alpha = 0.5) +
  geom_function(fun = function(x) w*(wr*dnorm(x, GC_color_mu_r, GC_color_sig_r) + (1-wr)*dnorm(x, GC_color_mu_b, GC_color_sig_b)) + (1-w)*dunif(x, 0.8, 2.4)) +
  geom_function(fun = function(x) w*(wr*dnorm(x, GC_color_mu_r, GC_color_sig_r) + (1-wr)*dnorm(x, GC_color_mu_b, GC_color_sig_b)), color = "#EBCC2A") +
  geom_function(fun = function(x) w*wr*dnorm(x, GC_color_mu_r, GC_color_sig_r), color = '#3B9AB2') +
  geom_function(fun = function(x) w*(1-wr)*dnorm(x, GC_color_mu_b, GC_color_sig_b), color = "#FF0000") +
  geom_function(fun = function(x) (1-w)*dunif(x, 0.8, 2.4), color = 'purple') + ylab('Density') + theme_minimal() + xlab('F475W - F814W')

Version	Author	Date
33fc566	david.li	2024-07-04
65a7d8c	david.li	2024-07-04

sessionInfo()

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Toronto
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Rcpp_1.0.11       mixtools_2.0.0    tikzDevice_0.12.5 lubridate_1.9.3  
 [5] forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2      
 [9] readr_2.1.4       tidyr_1.3.0       tibble_3.2.1      ggplot2_3.4.4    
[13] tidyverse_2.0.0   workflowr_1.7.1  

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.0         filehash_2.4-5           viridisLite_0.4.2       
 [4] farver_2.1.1             fastmap_1.2.0            lazyeval_0.2.2          
 [7] promises_1.2.1           digest_0.6.36            timechange_0.2.0        
[10] lifecycle_1.0.4          qpdf_1.3.3               survival_3.5-7          
[13] processx_3.8.2           magrittr_2.0.3           kernlab_0.9-32          
[16] compiler_4.3.2           rlang_1.1.4              sass_0.4.7              
[19] tools_4.3.2              utf8_1.2.4               yaml_2.3.7              
[22] data.table_1.14.8        knitr_1.45               askpass_1.2.0           
[25] labeling_0.4.3           htmlwidgets_1.6.2        bit_4.0.5               
[28] klippy_0.0.0.9500        withr_2.5.2              grid_4.3.2              
[31] fansi_1.0.6              git2r_0.33.0             colorspace_2.1-0        
[34] scales_1.3.0             MASS_7.3-60              tinytex_0.48            
[37] cli_3.6.1                rmarkdown_2.25           crayon_1.5.2            
[40] generics_0.1.3           rstudioapi_0.15.0        httr_1.4.7              
[43] tzdb_0.4.0               RcppArmadillo_0.12.6.6.0 cachem_1.0.8            
[46] splines_4.3.2            assertthat_0.2.1         parallel_4.3.2          
[49] vctrs_0.6.5              Matrix_1.6-3             jsonlite_1.8.7          
[52] callr_3.7.3              hms_1.1.3                bit64_4.0.5             
[55] magick_2.8.3             plotly_4.10.3            jquerylib_0.1.4         
[58] glue_1.6.2               ps_1.7.5                 stringi_1.8.4           
[61] gtable_0.3.4             later_1.3.1              munsell_0.5.0           
[64] pillar_1.9.0             htmltools_0.5.8.1        R6_2.5.1                
[67] rprojroot_2.0.4          vroom_1.6.4              evaluate_0.23           
[70] lattice_0.22-5           highr_0.10               segmented_2.0-3         
[73] httpuv_1.6.12            bslib_0.5.1              nlme_3.1-163            
[76] whisker_0.4.1            xfun_0.41                fs_1.6.3                
[79] getPass_0.2-4            pkgconfig_2.0.3          pdftools_3.4.0