Last updated: 2021-12-28

Checks: 7 0

Knit directory: ChromatinSplicingQTLs/analysis/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 89517ff. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    ._.DS_Store
    Ignored:    analysis/.Rhistory
    Ignored:    code/.DS_Store
    Ignored:    code/.Rhistory
    Ignored:    code/._.DS_Store
    Ignored:    code/._README.md
    Ignored:    code/._dag.pdf
    Ignored:    code/._report.html
    Ignored:    code/.ipynb_checkpoints/
    Ignored:    code/.snakemake/
    Ignored:    code/Alignments/
    Ignored:    code/ENCODE/
    Ignored:    code/ExpressionAnalysis/
    Ignored:    code/Fastq/
    Ignored:    code/FastqFastp/
    Ignored:    code/FastqFastpSE/
    Ignored:    code/FastqSE/
    Ignored:    code/Gather_gwas_summary_stats/
    Ignored:    code/Genotypes/
    Ignored:    code/Multiqc/
    Ignored:    code/Multiqc_chRNA/
    Ignored:    code/PeakCalling/
    Ignored:    code/Phenotypes/
    Ignored:    code/PlotGruberQTLs/
    Ignored:    code/ProCapAnalysis/
    Ignored:    code/QC/
    Ignored:    code/QTLs/
    Ignored:    code/ReferenceGenome/
    Ignored:    code/Session.vim
    Ignored:    code/SplicingAnalysis/
    Ignored:    code/TODO
    Ignored:    code/bigwigs/
    Ignored:    code/bigwigs_FromNonWASPFilteredReads/
    Ignored:    code/config/.DS_Store
    Ignored:    code/config/._.DS_Store
    Ignored:    code/config/ExternalFastqDataAccessions/
    Ignored:    code/config/OldSamplesConfig/
    Ignored:    code/dag.pdf
    Ignored:    code/featureCounts/
    Ignored:    code/gwas_summary_stats/
    Ignored:    code/hyprcoloc/
    Ignored:    code/logs/
    Ignored:    code/notebooks/.ipynb_checkpoints/
    Ignored:    code/out.hap.ld
    Ignored:    code/out.log
    Ignored:    code/report.html
    Ignored:    code/rules/OldRules/
    Ignored:    code/scratch/
    Ignored:    code/scripts/GTFtools_0.8.0/
    Ignored:    code/scripts/__pycache__/
    Ignored:    code/scripts/liftOverBedpe/liftOverBedpe.py
    Ignored:    code/snakemake.log
    Ignored:    code/snakemake.sbatch.log
    Ignored:    data/._PRJEB1350_RunTable.Ding_etal_CTCF.txt
    Ignored:    data/._igsr_samples.tsv
    Ignored:    data/._list_gwas_summary_statistics_PMID27863252.csv
    Ignored:    data/GrowthNotes/._20210830_GrowthNotes_chRNA.ConcentrationsToReplate.txt

Untracked files:
    Untracked:  code/snakemake_profiles/slurm/__pycache__/

Unstaged changes:
    Modified:   code/scripts/GenometracksByGenotype
    Modified:   output/hyprcoloc_results/ForColoc/hyprcoloc.results.txt.gz

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/20200210_PickSamplesToOrder.Rmd) and HTML (docs/20200210_PickSamplesToOrder.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 89517ff Benjmain Fair 2021-12-22 added genewise coloc rmd
Rmd d781076 Benjmain Fair 2021-06-15 added to strains picking rmd
Rmd cf3d158 Benjmain Fair 2021-04-12 added samples to order
Rmd 304e9b7 Benjmain Fair 2021-02-11 added data about picking samples

motivation

So far, Austin and I have performed cell fractionation on ~50 YRI LCL lines. We have had dissappointing success at waking up some of the other YRI LCL lines that Gilad lab has frozen stocks of. So we will order most of the remaining YRI LCL lines that are available from Coriell cell repository. Coriell has 31 unrelated individuals, 60 trios, and 9 duos. For this project, we want all of our samples to be unrelated, so I can pick one sample from each trio or duo. Ideally, I will order the child for each duo or trio since it probably has the most accurate phasing information in the published HapMap genotypes. I also want to exclude ordering samples from families that we have already done from Gilad lab stocks.

Here I will read in data and use some R set functions to help me identify the cell lines that I should order for this experiment…

library(tidyverse)
library(readxl)
library(knitr)

#Downloaded from Coriell
CoriellLines <- read_delim("../data/GiladStrains/Coriell-Catalog-Export-02-10-2021.csv", delim = ",") %>%
  mutate(RowNumber = as.character(1:nrow(.))) %>% #When family is blank, replace it with new identifier
  mutate(Family=case_when(
    is.na(Family) ~ RowNumber,
    TRUE ~ Family
  ))

#Downloaded from 1000G
FamilyRelationships <- read_delim("../data/GiladStrains/1000G_Pedigree.txt", delim='\t')

LCLs.1000Genomes.vec <- read.table("../data/GiladStrains/YRI.list.from1000Genomes.txt", sep='\t') %>% pull(V1) %>%
  gsub("^NA", "", ., perl=T) %>% as.numeric()

GiladLabYRI.Rack <- read_delim("../data/GiladStrains/LCL_YRI.Gilad.Positions2.txt", delim = '\t') %>%
  mutate(Line=as.character(Line))

GiladLabYRIRack.Lines <- GiladLabYRI.Rack %>% pull(Line) %>% unique() %>% as.numeric()

GrowthNotes <- read_delim("../data/GrowthNotes/20200210_GrowthNotes.tsv", delim = '\t')

SamplesAlreadyDone <- GrowthNotes %>%
  filter(!is.na(`Date of chRNA-seq library Prep`)) %>%
  pull(`Cell Line`) %>% unique()

What is the sample that we have already done twice?

GrowthNotes %>%
  filter(!is.na(`Date of chRNA-seq library Prep`)) %>%
  pull(`Cell Line`) %>% table()
.
18486 18497 18499 18505 18508 18511 18519 18520 18522 18852 18855 18856 18858 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
18870 18907 18909 18912 18913 19092 19093 19099 19101 19102 19114 19119 19127 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
19128 19130 19131 19137 19138 19140 19141 19152 19153 19160 19171 19190 19200 
    2     1     1     1     1     1     1     1     1     1     1     1     1 
19201 19207 19209 19210 19223 19225 19238 19239 19257 
    1     1     1     1     1     1     1     1     1 

19128 has two “good” samples. I should let Austin now. That would bring the total to 48 samples, which is more convenient for multichannelling anyway.

Find these lines in the hapmap project. Make sure not to order additional lines from those families.

This block of code is now irrelevant, now that I see Coriell has family info as well.

#Check that all 48 samples we have already done are represented in this family info. How many are the children?
FamilyRelationships %>%
  filter(Population == "YRI") %>%
  pull(Relationship) %>% table()
.
               child               father               mother 
                  58                   61                   64 
          not father paternal grandmother 
                   2                    1 
FamilyRelationships %>%
  filter(Population == "YRI") %>%
  count(`Family ID`)
# A tibble: 68 x 2
   `Family ID`     n
   <chr>       <int>
 1 Y001            3
 2 Y002            3
 3 Y003            3
 4 Y004            3
 5 Y005            3
 6 Y006            3
 7 Y007            3
 8 Y009            3
 9 Y010            2
10 Y010a           1
# … with 58 more rows
FamilyRelationships %>%
  filter(Population == "YRI") %>%
  mutate(CellLine = str_replace(`Individual ID`, "NA(\\d{5})", "\\1")) %>%
  mutate(IsAlreadyDone = CellLine %in% SamplesAlreadyDone) %>%
  filter(IsAlreadyDone) %>%
  pull(Relationship) %>% table()
.
 child father mother 
     1     20     27 

Let’s check that we get the same results from Coriell’s family info

CoriellLines %>%
  mutate(CellLine = str_replace(ID, "GM(\\d{5})", "\\1")) %>%
  mutate(IsAlreadyDone = CellLine %in% SamplesAlreadyDone) %>%
  filter(IsAlreadyDone) %>%
  pull(`Relationship to Proband`) %>% table()
.
father mother 
    20     27 

Ok. Close enough.

I wonder why all the ones we have done so far are mother or father. Was Gilad lab deliberately choosing not the proband samples?

CoriellLines %>% nrow()
[1] 120
CoriellLines %>%
  pull(`Relationship to Proband`) %>% table()
.
  child  father  mother proband 
      2      56      57       5 
#How many singles, duos, and trios do I count?
CoriellLines %>%
  count(Family) %>%
  pull(n) %>% table()
.
 1  2  3 
10 52  2 

Ok, it seems that Coriell’s claim of 60 trios, 9 duos, and 31 unrelated individuals isn’t adding up.

Let’s grab the list of all lines available from Coriell that found when I click the trios link.

CoriellLines_trios <- read_delim("../data/GiladStrains/Coriell-Catalog-Export-02-10-2021-Trios.csv", delim = ",") %>%
  mutate(RowNumber = as.character(1:nrow(.))) %>% #When family is blank, replace it with new identifier
  mutate(Family=case_when(
    is.na(Family) ~ RowNumber,
    TRUE ~ Family
  ))

CoriellLines_trios %>% nrow()
[1] 180
CoriellLines_trios %>% count(Family) %>% 
  pull(n) %>% table()
.
 3 
60 
CoriellLines_trios %>%
  filter(ID %in% CoriellLines$ID) %>% nrow()
[1] 113

Ok, it seems to me that the original link I clicked with 120 samples is a subset of the total available samples. I can probably see the full list of available samples by combining the trios, duos, and unrelated individuals.

CoriellLines_trios <- read_delim("../data/GiladStrains/Coriell-Catalog-Export-02-10-2021-Trios.csv", delim = ",")
CoriellLines_duos <- read_delim("../data/GiladStrains/Coriell-Catalog-Export-02-10-2021-Duos.csv", delim = ",")
CoriellLines_singles <- read_delim("../data/GiladStrains/Coriell-Catalog-Export-02-10-2021-Singles.csv", delim = ",")

AllCoriellSamples <- bind_rows(CoriellLines_singles, CoriellLines_duos, CoriellLines_trios, .id = "Set") %>%
  mutate(CellLine = str_replace(ID, "GM(\\d{5})", "\\1")) %>%
  mutate(IsAlreadyDone = CellLine %in% SamplesAlreadyDone)

AllCoriellSamples %>% nrow()
[1] 229
AllCoriellSamples %>% count(Family) %>% 
  pull(n) %>% table()
.
 2  3 31 
 9 60  1 
AllCoriellSamples %>%
  filter(ID %in% CoriellLines$ID) %>% nrow()
[1] 120
AllCoriellSamples %>%
  pull(`Relationship to Proband`) %>% table()
.
  child  father  mother proband 
     69      63      66      31 

Ok, now it is clear that those 120 samples are just a subset of all the ones that Coriell has for sale. Let’s check again about the proband relationship status of the 48 samples we have already done.

AllCoriellSamples %>%
  pull(`Relationship to Proband`) %>% table()
.
  child  father  mother proband 
     69      63      66      31 

Ok, that is interesting that most all the strains we have done so far are father or mother. Maybe that is just because at the time, in the earlier days of HapMap, the parents data were collected first, then as 1000G expanded, the children were collected later. I dunno. Anyway, let’s just go ahead and pick out more unrelated samples. I guess to be consistent with the lines we have already done, maybe we should avoid the children when possible, in case age of donor has some effect on gene expression. For all I know the children are babies. Maybe 1000 genome’s project has this info.

Update: As per 1000G FAQ page, all donors are >=18 years of age.

First let’s check that the sample we have already done are unrelated. Or at least, mother and father so still unrelated.

AllCoriellSamples %>%
  filter(CellLine %in% c(SamplesAlreadyDone)) %>%
  pull(Family) %>% table() %>% table()
.
 1  2 
20 14 
AllCoriellSamples %>%
  filter(CellLine %in% c(SamplesAlreadyDone)) %>%
  dplyr::select(CellLine, `Relationship to Proband`, Family) %>%
  arrange(Family)
# A tibble: 48 x 3
   CellLine `Relationship to Proband` Family
   <chr>    <chr>                     <chr> 
 1 18486    father                    Y001  
 2 18497    child                     Y003  
 3 18499    mother                    Y003  
 4 18505    mother                    Y005  
 5 18508    mother                    Y009  
 6 18511    mother                    Y010  
 7 18858    mother                    Y012  
 8 18519    father                    Y014  
 9 18520    mother                    Y014  
10 18522    father                    Y016  
# … with 38 more rows
AllCoriellSamples %>%
  add_count(Family, name = "FamilySize") %>%
  count(FamilySize, `Relationship to Proband`)
# A tibble: 7 x 3
  FamilySize `Relationship to Proband`     n
       <int> <chr>                     <int>
1          2 child                         9
2          2 father                        3
3          2 mother                        6
4          3 child                        60
5          3 father                       60
6          3 mother                       60
7         31 proband                      31

Hmm.. Ok, so this is making more sense…

So firstly, it seems that the unrelated samples are labelled as ‘proband’ rather than ‘child’ under the ‘relationship to proband’, and those samples have NA as family. These are all probably safe to pick as unrelated individuals for this experiment.

Also, it seems that out of the 48 samples we have already done, 20 are in a unique family, a there are 14 duos that are in the same family. Luckily since it seems that we have mostly picked mothers and fathers, these samples are mostly still genetically unrelated. (Except for a single mother-child relationship in family Y003). This is probably why Gilad lab generally picked fathers and mothers…. That way if someone who wasn’t careful (like me) and just picked seemingly random lines from the freezer for an experiment, they would still probably mostly be genetically unrelated.

So let’s actually not pick children from the trios/duos, and pick parents in as many cases as possible. In the cases of trios, that will actually leave us more unrelated samples to pick from, as we can utilize both parents. Well, maybe that still isn’t the best idea in case shared family environment would violote the independence assumptions of linear modelling that QTL data. So perhaps still best to just pick parents, and only parents from different families. My guess is that the shared environment would have minimual effect (these are LCLs, quite removed from shared familial environments anyway), so it isn’t totally unreasonable to use both parents which would open up a larger potential sample size. But since we want to keep the experiment size manageable, lets still just prioritize one parent per family as the first cell lines to order.

Also, let’s prioritize samples that are 1000G… After emailing Coriell customer service, I now realize that Coriell sells all samples in the union of HapMap and 1000G. The 120 samples that I was confused about before are just the 1000G samples (many of which are also HapMap). Given the choice, I would prefer to prioritize the 1000G samples since they probably have more complete and accurate genotype information (1000G generally used high coverage WholeGenomeSequencing, HapMap generally used SNP arrays). Also important to check, is that we have roughly equal numbers of male and female.

#Fist let's add a column for if a sample is in 1000 Genome project, and a column for if it is a line in Gilad lab freezer
AllCoriellSamples <-
  AllCoriellSamples %>%
  mutate(IsInGiladLabYRI.Inventory = CellLine %in% GiladLabYRIRack.Lines) %>%
  mutate(Is1KGP = CellLine %in% LCLs.1000Genomes.vec)

#Get list of samples that are fair game to do in the experiment, meaning they are either unrelated to the samples we have already done and are not the children in the case of trios. In the case of duos, children are fine, but let's just pick the parent anyway.
AllUnrelatedSamplesLeftToDo <- AllCoriellSamples %>%
  filter(! IsAlreadyDone) %>%
  filter(!`Relationship to Proband` == "child")

nrow(AllUnrelatedSamplesLeftToDo)
[1] 113

Ok, there are a total of unrelated 113 samples we could potentially use.

Now let’s choose the ones to order first. Let’s first use all the Gilad lab lines that we already have of the unrelated samples left to do, then select from the list of other unrelated samples and randomly pick a samples that are in 1000G and from unique families until we have a total batch size of 30. I figure 28 is a nice number since our past batches have been roughly this size.

#Get the lines we can just pick from Gilad lab freezer
GiladLabLinesToPick <- AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(IsInGiladLabYRI.Inventory) %>% pull(CellLine)

GiladLabLinesToPick
[1] "18510" "19147" "18853" "18502" "18517" "19193"
NumGiladLinesToPick <- length(GiladLabLinesToPick)

FamaliesAlreadyDoneOrInGiladLinesToPick <- AllCoriellSamples %>%
  filter(CellLine %in% c(SamplesAlreadyDone, GiladLabLinesToPick)) %>%
  pull(Family) %>% na.omit()

#Now get cell lines to order from Coriell
PotentialSamplesToOrder <- AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(!CellLine %in% GiladLabLinesToPick) %>%
  filter(!Family %in% FamaliesAlreadyDoneOrInGiladLinesToPick)  %>%
  filter(Is1KGP) %>%
  distinct(Family, .keep_all = T) %>%
  pull(CellLine)

PotentialSamplesToOrder %>% length()
[1] 23

Ok, that critera of requiring samples that we order are - Not already done - Not in Gilad lab freezer - Are in 1000 Genome’s project - Not in a family that we have already done - Only one sample per family

results in a list of 23 samples. If we ordered those 23, we would have a total of 29 samples in this batch. Which is close to the 30 was shooting for… Let’s check that they are roughly sex-balanced.

AllCoriellSamples %>%
  filter(CellLine %in% PotentialSamplesToOrder) %>%
  pull(Sex) %>% table()
.
Female   Male 
    18      5 

Ok, not so much. Let’s relax the family requirements and then randomly pick some, hoping that we come up with a more sex balanced sex. But actually, let’s first check the sex of the samples we have already done:

AllCoriellSamples %>%
  filter(CellLine %in% c(SamplesAlreadyDone, GiladLabLinesToPick)) %>%
  pull(Sex) %>% table()
.
Female   Male 
    31     23 

Ok, not so balanced. let’s relax the family restriction, then randomly select 8 females and 16 males, leaving us with 39M + 39F at the conclusion of this next batch.

set.seed(0)
PotentialSamplesToOrder.Male <- AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(!CellLine %in% GiladLabLinesToPick) %>%
  filter(Is1KGP) %>%
  filter(Sex == "Male") %>%
  sample_n(16) %>% pull(CellLine)

set.seed(0)
PotentialSamplesToOrder.Female <- AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(!CellLine %in% GiladLabLinesToPick) %>%
  filter(Is1KGP) %>%
  filter(Sex == "Female") %>%
  sample_n(8) %>% pull(CellLine)

SamplesToOrder <- c(PotentialSamplesToOrder.Female, PotentialSamplesToOrder.Male)
SamplesToOrder
 [1] "18873" "19254" "18878" "19196" "19222" "19214" "18924" "19143" "18487"
[10] "19104" "18862" "18879" "18501" "19096" "18507" "18917" "18859" "19150"
[19] "19144" "18868" "18874" "18504" "19184" "19189"

Ok, let’s do a final sanity check before I purchase that these lines do not contain any children (implying no blood relation to other samples), that they do not overlap with any samples we have already done, and that they do not overlap with the 6 samples we can take from Gilad lab freezer…

AllCoriellSamples %>%
  filter(CellLine %in% SamplesToOrder) %>%
  pull(`Relationship to Proband`) %>% table()
.
 father  mother proband 
     14       5       5 
intersect(SamplesToOrder, as.character(SamplesAlreadyDone))
character(0)
intersect(SamplesToOrder, GiladLabLinesToPick)
character(0)
SamplesToOrder
 [1] "18873" "19254" "18878" "19196" "19222" "19214" "18924" "19143" "18487"
[10] "19104" "18862" "18879" "18501" "19096" "18507" "18917" "18859" "19150"
[19] "19144" "18868" "18874" "18504" "19184" "19189"

All looks good. I will put these on the order sheet 2020-02-11. Let’s write them out to an excel that we can email to Claudia.

AllCoriellSamples %>%
  filter(CellLine %in% SamplesToOrder) %>%
  dplyr::select(-c("Set")) %>%
  dplyr::select(ID:Family) %>%
  write_excel_csv("../data/GiladStrains/20200211_CoriellOrder.BF.csv" )

##Update 04/12/2021

Going to order more cell lines, by the same critera.

First out of curiosity, how many cell that we have already done or plan to do are not in 1KG?

AllCoriellSamples %>%
  filter(CellLine %in% c(SamplesAlreadyDone)) %>%
  mutate(In1KG = CellLine %in% LCLs.1000Genomes.vec) %>%
  pull(In1KG) %>% table()
.
TRUE 
  48 
# length(setdiff(as.character(LCLs.1000Genomes.vec), SamplesToOrder))
# length(LCLs.1000Genomes.vec)

Ok it seems ALL the samples we have already done or have recently ordered are 1000 genome’s project. Now let’s get all the samples that we haven’t already done that are in 1KG that we could consider ordereing:

set.seed(0)
AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(!CellLine %in% GiladLabLinesToPick) %>%
  filter(Is1KGP) %>%
  filter(!CellLine %in% SamplesToOrder) %>%
  group_by(Sex) %>%
  sample_n(16) %>%
  ungroup() %>%
  dplyr::select(-c("Set")) %>%
  dplyr::select(ID:Family) %>%
  write_excel_csv("../data/GiladStrains/20210411_CoriellOrder2.BF.csv" )

set.seed(0)
SecondCoriellOrder <- AllCoriellSamples %>%
  filter(CellLine %in% AllUnrelatedSamplesLeftToDo$CellLine) %>%
  filter(!CellLine %in% GiladLabLinesToPick) %>%
  filter(Is1KGP) %>%
  filter(!CellLine %in% SamplesToOrder) %>%
  group_by(Sex) %>%
  sample_n(16) %>%
  ungroup() %>%
  pull(ID)

SecondCoriellOrder
 [1] "GM19176" "GM19204" "GM19166" "GM18916" "GM18864" "GM18881" "GM19118"
 [8] "GM19116" "GM19197" "GM19206" "GM19247" "GM18489" "GM19108" "GM19185"
[15] "GM19172" "GM18488" "GM19198" "GM19213" "GM18498" "GM19256" "GM19146"
[22] "GM18516" "GM19121" "GM18865" "GM19226" "GM18871" "GM19098" "GM18923"
[29] "GM19195" "GM18908" "GM18915" "GM18877"
SecondCorellOrderNoGM <- str_remove(SecondCoriellOrder, "GM")

Update 06/15/2021

Ok, the sequencing results from the first batch of ~40 samples are back. Some of them had sample swaps or too low read depth to confidently identify the sample, resulting a total of ~30 usable samples. Let’s plan on attempting to regrow all samples that we have in the gilad lab freezer that were not covered in the last two batches and were not among the 30 samples with already usable data.

#Lines for which qtltools mbv could confidently identify the individual from RNA-seq data
LinesWithQualityData <-
  read_tsv("../data/20210604_chRNA_SampleIDs_FromBamToFix.txt") %>%
  drop_na() %>%
  pull(BestMatch) %>%
  str_replace("NA", "")


SamplesToThawAgain <- AllCoriellSamples %>%
  # filter out related samples (ie not children, since parents are assumed unrelated)
  filter(!`Relationship to Proband` == "child") %>%
  # filter for samples in gilad freezer
  filter(IsInGiladLabYRI.Inventory) %>%
  # filter out first batch of coriell order
  filter(! CellLine %in% SamplesToOrder) %>%
  # filter out the 6 gilad lab lines not yet attempted library prep but done in last batch
  filter(! CellLine %in% GiladLabLinesToPick) %>%
  # filter out the second batch coriell order
  filter( ! CellLine %in%  str_replace(SecondCoriellOrder, "GM", "")) %>%
  # filter out the lines that we already have quality data for
  filter (! CellLine %in% LinesWithQualityData) %>%
  pull(CellLine)

SamplesToThawAgain
[1] "19223" "18870" "19128" "19141" "19207" "19239"

Finally, as per Genevieve’s request, I should identify which of the strains I ordered are completly new to the lab, so others can prioritize those lines for creating new ipsc lines.

#The lines Gilad lad previous had are either in this vector:
length(GiladLabYRIRack.Lines)
[1] 27
#... or in this vector:
OtherLines <- read_tsv("../data/GiladStrains/Human_LCLs.txt") %>%
  rename(CellLine=`Human LCL ID`) %>%
  distinct(CellLine, .keep_all=T) %>%
  pull(CellLine)


#Now get the lines that we ordered that are not in those previous lines
AllLinesOrderedThatAreNewToGiladLab <- data.frame(CellLine = c(str_remove(SecondCoriellOrder, "GM"), SamplesToOrder)) %>%
  filter(!CellLine %in% c(OtherLines, GiladLabYRIRack.Lines)) %>%
  mutate(CellLine = as.character(CellLine)) %>%
  pull(CellLine)

Write out list for Genevieve…

read_tsv("../data/GiladStrains/AustinBankedCellList.txt") %>%
  select(Line, Plate_In_LCL_Rack=Plate) %>%
  mutate(Line = str_replace(Line, "(\\d+?)[AB*]", "\\1")) %>%
  mutate(IsNewToGiladLab = Line %in% AllLinesOrderedThatAreNewToGiladLab) %>%
  write_tsv("~/Desktop/ListOfLinesThatAustinStockedInLCLRack.tsv")
  # mutate(IsOrderedFromCoriell = Line %in% c(str_remove(SecondCoriellOrder, "GM"), SamplesToOrder)) %>%
  # drop_na() %>%
  # pull(IsNewToGiladLab) %>% table()

sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /project2/yangili1/bjf79/ChromatinSplicingQTLs/code/.snakemake/conda/4480a43d58942cece9fb73087fc984b8/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.31      readxl_1.3.1    forcats_0.5.1   stringr_1.4.0  
 [5] dplyr_1.0.5     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3    
 [9] tibble_3.1.0    ggplot2_3.3.3   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0  xfun_0.20         bslib_0.2.4       haven_2.3.1      
 [5] colorspace_2.0-0  vctrs_0.3.7       generics_0.1.0    htmltools_0.5.1.1
 [9] yaml_2.2.1        utf8_1.2.1        rlang_0.4.10      jquerylib_0.1.3  
[13] later_1.1.0.1     pillar_1.5.1      withr_2.4.1       glue_1.4.2       
[17] DBI_1.1.1         dbplyr_2.1.1      modelr_0.1.8      lifecycle_1.0.0  
[21] cellranger_1.1.0  munsell_0.5.0     gtable_0.3.0      workflowr_1.6.2  
[25] rvest_1.0.0       evaluate_0.14     ps_1.6.0          httpuv_1.5.5     
[29] fansi_0.4.2       broom_0.7.6       Rcpp_1.0.6        promises_1.2.0.1 
[33] backports_1.2.1   scales_1.1.1      jsonlite_1.7.2    fs_1.5.0         
[37] hms_1.0.0         digest_0.6.27     stringi_1.4.3     rprojroot_2.0.2  
[41] grid_3.6.1        cli_2.4.0         tools_3.6.1       magrittr_2.0.1   
[45] sass_0.3.1        crayon_1.4.1      whisker_0.4       pkgconfig_2.0.3  
[49] ellipsis_0.3.1    xml2_1.3.2        reprex_1.0.0      lubridate_1.7.10 
[53] rstudioapi_0.13   assertthat_0.2.1  rmarkdown_2.7     httr_1.4.2       
[57] R6_2.5.0          git2r_0.28.0      compiler_3.6.1