Last updated: 2022-09-01
Checks: 7 0
Knit directory: muse/ 
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version c8588b7. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/
    Ignored:    r_packages_4.2.0/
Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
Unstaged changes:
    Modified:   script/run_rstudio.sh
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/split_column.Rmd) and HTML
(docs/split_column.html) files. If you’ve configured a
remote Git repository (see ?wflow_git_remote), click on the
hyperlinks in the table below to view the files as they were in that
past version.
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | c8588b7 | Dave Tang | 2022-09-01 | Split GTF group column | 
| html | d6644d2 | Dave Tang | 2021-06-22 | Build site. | 
| Rmd | 8ae8e21 | Dave Tang | 2021-06-22 | Post on splitting a single column of annotations | 
Two widely used file formats in bioinformatics, VCF and GTF, have
single columns that are packed with annotation information. This makes
them a bit inconvenient to work with in R when using data frames because
the values need to be unpacked, i.e. split. In addition, this violates
one of the conditions for tidy data, which is that every cell is a
single value. In this post, we will use tools from the
tidyverse to split the values into multiple columns to make
the data easier to work with in R.
To get started, install the tidyverse if you haven’t
already.
if(!require("tidyverse")){
  install.packages("tidyverse")
}
library(tidyverse)I have a small package called importbio that can be
used to load a VCF and GTF file. You can install it using the
remotes package.
if(!require("remotes")){
  install.packages("remotes")
}
if(!require("importbio")){
  remotes::install_github('davetang/importbio')
}We will load a small VCF file using importvcf.
library(importbio)
my_vcf <- importvcf("https://raw.githubusercontent.com/davetang/learning_vcf_file/master/eg/Pfeiffer.vcf")
my_vcf %>%
  select(info) %>%
  head()# A tibble: 6 × 1
  info                                                                          
  <chr>                                                                         
1 AC=2;AF=1.00;AN=2;DB;DP=11;FS=0.000;HRun=0;HaplotypeScore=41.3338;MQ0=0;MQ=61…
2 AC=1;AF=0.50;AN=2;BaseQRankSum=1.455;DB;DP=21;Dels=0.00;FS=1.984;HRun=0;Haplo…
3 AC=1;AF=0.50;AN=2;BaseQRankSum=1.934;DP=48;Dels=0.00;FS=4.452;HRun=0;Haplotyp…
4 AC=1;AF=0.50;AN=2;BaseQRankSum=-4.517;DB;DP=29;Dels=0.00;FS=1.485;HRun=0;Hapl…
5 AC=1;AF=0.50;AN=2;BaseQRankSum=0.199;DB;DP=33;Dels=0.00;FS=0.000;HRun=1;Haplo…
6 AC=1;AF=0.50;AN=2;BaseQRankSum=-0.259;DB;DP=12;Dels=0.00;FS=0.000;HRun=1;Hapl…Note that the info column is packed with all sorts of
information for each variant. Also note the consistent format of the
info column: each annotation is separated by a semi-colon
(;) and annotations are stored as key-value pairs with an
equal sign in between.
Firstly, we will use separate_rows to create a new row
for each annotation by using ; as the separator/delimiter;
note that I have included \\s* after ;, which
is a regex for specifying a single whitespace occurring 0 or more times.
By including the regex, whitespace after ; will be removed,
which is good because we do not want whitespace in our data. In
addition, a mutate call is used prior to calling
separate_rows and it is used to remove a trailing
semicolon, if it exists.
my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  head()# A tibble: 6 × 10
  vid              chrom    pos id         ref   alt   qual   filter info  type 
  <chr>            <fct>  <int> <chr>      <chr> <chr> <chr>  <chr>  <chr> <chr>
1 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AC=2  ins  
2 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AF=1… ins  
3 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   AN=2  ins  
4 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DB    ins  
5 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   DP=11 ins  
6 1_866511_C_CCCCT 1     866511 rs60722469 C     CCCCT 258.62 PASS   FS=0… ins  The next step is to split the key-value pairs and we will use the
separate function to separate the pairs into two columns,
which we will name key and value, using the
equal sign as the separator/delimiter. Sometimes a key is missing a
value and in these cases, the value will be NA.
my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  separate(info, c('key', 'value'), sep = "=") %>%
  head(10)# A tibble: 10 × 11
   vid             chrom    pos id    ref   alt   qual  filter key   value type 
   <chr>           <fct>  <int> <chr> <chr> <chr> <chr> <chr>  <chr> <chr> <chr>
 1 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   AC    2     ins  
 2 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   AF    1.00  ins  
 3 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   AN    2     ins  
 4 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   DB    <NA>  ins  
 5 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   DP    11    ins  
 6 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   FS    0.000 ins  
 7 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   HRun  0     ins  
 8 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   Hapl… 41.3… ins  
 9 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   MQ0   0     ins  
10 1_866511_C_CCC… 1     866511 rs60… C     CCCCT 258.… PASS   MQ    61.94 ins  The current state of the transformation produces a new row for each
variant annotation and two columns containing the key and value. If we
want our data in wide format where each annotation is a column, we can
use the pivot_wider function.
In the code below, I have used the first eight columns
(id_cols = vid:filter) to specify the columns that uniquely
identifies each variant, i.e. the same variant will have the same values
in these columns. We specify our column names from the key
column and the values for these cells will come from the
value column.
my_vcf %>%
  mutate(info = sub(pattern = ";$", replacement = "", x = .data$info)) %>%
  separate_rows(info, sep = ";\\s*") %>%
  separate(info, c('key', 'value'), sep = "=") %>%
  distinct() %>% # remove duplicated annotations, if any
  pivot_wider(id_cols = vid:filter, names_from = key, values_from = value) %>%
  head(10)# A tibble: 10 × 28
   vid       chrom    pos id    ref   alt   qual  filter AC    AF    AN    DB   
   <chr>     <fct>  <int> <chr> <chr> <chr> <chr> <chr>  <chr> <chr> <chr> <chr>
 1 1_866511… 1     866511 rs60… C     CCCCT 258.… PASS   2     1.00  2     <NA> 
 2 1_879317… 1     879317 rs75… C     T     150.… PASS   1     0.50  2     <NA> 
 3 1_879482… 1     879482 .     G     C     484.… PASS   1     0.50  2     <NA> 
 4 1_880390… 1     880390 rs37… C     A     288.… PASS   1     0.50  2     <NA> 
 5 1_881627… 1     881627 rs22… G     A     486.… PASS   1     0.50  2     <NA> 
 6 1_884091… 1     884091 rs75… C     G     65.46 PASS   1     0.50  2     <NA> 
 7 1_884101… 1     884101 rs49… A     C     85.81 PASS   1     0.50  2     <NA> 
 8 1_892460… 1     892460 rs41… G     C     1736… PASS   1     0.50  2     <NA> 
 9 1_897730… 1     897730 rs75… C     T     225.… PASS   1     0.50  2     <NA> 
10 1_909238… 1     909238 rs38… G     C     247.… PASS   1     0.50  2     <NA> 
# … with 16 more variables: DP <chr>, FS <chr>, HRun <chr>,
#   HaplotypeScore <chr>, MQ0 <chr>, MQ <chr>, QD <chr>, set <chr>,
#   BaseQRankSum <chr>, Dels <chr>, MQRankSum <chr>, ReadPosRankSum <chr>,
#   DS <chr>, GENE <chr>, INHERITANCE <chr>, MIM <chr>Now each row is a single variant and each column is a variable.
The GTF also has a column packed with key-value pairs.
my_gtf <- read_tsv(
  file = "https://github.com/davetang/importbio/raw/master/inst/extdata/gencode.v38.annotation.sample.gtf.gz",
  comment = "#",
  show_col_types = FALSE,
  col_names = c('chr', 'src', 'feat', 'start', 'end', 'score', 'strand', 'frame', 'group'))
my_gtf %>%
  select(group) %>%
  head()# A tibble: 6 × 1
  group                                                                         
  <chr>                                                                         
1 "gene_id \"ENSG00000223972.5\"; gene_type \"transcribed_unprocessed_pseudogen…
2 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
3 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
4 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
5 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000456328.2\"; gene_typ…
6 "gene_id \"ENSG00000223972.5\"; transcript_id \"ENST00000450305.2\"; gene_typ…We can use the same strategy (but with some additional formatting steps) to split the column up.
my_gtf %>%
  mutate(group = sub(pattern = ";$", replacement = "", x = .data$group)) %>%
  mutate(group = gsub(pattern = '"', replacement = "", x = .data$group)) %>%
  separate_rows(group, sep = ";\\s*") %>%
  separate(group, c('key', 'value'), sep = "\\s") %>%
  distinct() %>% # remove duplicated annotations, if any
  pivot_wider(id_cols = chr:frame, names_from = key, values_from = value) -> my_gtf_splitWarning: Values from `value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = {summary_fun}` to summarise duplicates.
* Use the following dplyr code to identify duplicates.
  {data} %>%
    dplyr::group_by(chr, src, feat, start, end, score, strand, frame, key) %>%
    dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
    dplyr::filter(n > 1L)head(my_gtf_split, 10)# A tibble: 10 × 24
   chr   src    feat      start   end score strand frame gene_id gene_…¹ gene_…²
   <chr> <chr>  <chr>     <dbl> <dbl> <chr> <chr>  <chr> <list>  <list>  <list> 
 1 chr1  HAVANA gene      11869 14409 .     +      .     <chr>   <chr>   <chr>  
 2 chr1  HAVANA transcri… 11869 14409 .     +      .     <chr>   <chr>   <chr>  
 3 chr1  HAVANA exon      11869 12227 .     +      .     <chr>   <chr>   <chr>  
 4 chr1  HAVANA exon      12613 12721 .     +      .     <chr>   <chr>   <chr>  
 5 chr1  HAVANA exon      13221 14409 .     +      .     <chr>   <chr>   <chr>  
 6 chr1  HAVANA transcri… 12010 13670 .     +      .     <chr>   <chr>   <chr>  
 7 chr1  HAVANA exon      12010 12057 .     +      .     <chr>   <chr>   <chr>  
 8 chr1  HAVANA exon      12179 12227 .     +      .     <chr>   <chr>   <chr>  
 9 chr1  HAVANA exon      12613 12697 .     +      .     <chr>   <chr>   <chr>  
10 chr1  HAVANA exon      12975 13052 .     +      .     <chr>   <chr>   <chr>  
# … with 13 more variables: level <list>, hgnc_id <list>, havana_gene <list>,
#   transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>,
#   exon_number <list>, exon_id <list>, ont <list>, protein_id <list>, and
#   abbreviated variable names ¹gene_type, ²gene_nameHowever, the split columns are lists because there were some cases
where there were multiple annotations with the same key and a list is
needed to store multiple values (which was what the warning above was
about). For example the tag key was repeated more than once
with different unique values for some annotations.
map_lgl(my_gtf_split$tag, function(x) length(x) > 1) [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[25]  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[37] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[49]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[61]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUEWe can check which columns have multiple values.
check_column <- function(x){
  any(map_lgl(x, function(y) length(y) > 1))
}
my_check <- map_lgl(my_gtf_split, check_column)
my_check[my_check]    transcript_id   transcript_name               tag havana_transcript 
             TRUE              TRUE              TRUE              TRUE 
      exon_number               ont 
             TRUE              TRUE Therefore despite only a subset of the columns containing multiple
values, all the pivoted columns were turned into lists. However we can
turn the columns back into characters, which makes sense for the
gene_id column which only contained single unique character
values in the first place.
my_gtf_split %>%
  mutate(gene_id = as.character(gene_id)) %>%
  head()# A tibble: 6 × 24
  chr   src   feat  start   end score strand frame gene_id gene_…¹ gene_…² level
  <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>  <chr> <chr>   <list>  <list>  <lis>
1 chr1  HAVA… gene  11869 14409 .     +      .     ENSG00… <chr>   <chr>   <chr>
2 chr1  HAVA… tran… 11869 14409 .     +      .     ENSG00… <chr>   <chr>   <chr>
3 chr1  HAVA… exon  11869 12227 .     +      .     ENSG00… <chr>   <chr>   <chr>
4 chr1  HAVA… exon  12613 12721 .     +      .     ENSG00… <chr>   <chr>   <chr>
5 chr1  HAVA… exon  13221 14409 .     +      .     ENSG00… <chr>   <chr>   <chr>
6 chr1  HAVA… tran… 12010 13670 .     +      .     ENSG00… <chr>   <chr>   <chr>
# … with 12 more variables: hgnc_id <list>, havana_gene <list>,
#   transcript_id <list>, transcript_type <list>, transcript_name <list>,
#   transcript_support_level <list>, tag <list>, havana_transcript <list>,
#   exon_number <list>, exon_id <list>, ont <list>, protein_id <list>, and
#   abbreviated variable names ¹gene_type, ²gene_nameBut we can also do this to the tag column (even if it
needed a list to store the multiple values) and entries with multiple
values get turned into R (character) code!
my_gtf_split %>%
  mutate(tag = as.character(tag)) %>%
  select(tag) %>%
  head()# A tibble: 6 × 1
  tag                                  
  <chr>                                
1 "NULL"                               
2 "basic"                              
3 "basic"                              
4 "basic"                              
5 "basic"                              
6 "c(\"basic\", \"Ensembl_canonical\")"If you don’t mind having R (character) code in your data, you can perform this transformation across all pivoted columns.
my_gtf_split %>%
  mutate(across(gene_id:protein_id, as.character))# A tibble: 94 × 24
   chr   src    feat      start   end score strand frame gene_id gene_…¹ gene_…²
   <chr> <chr>  <chr>     <dbl> <dbl> <chr> <chr>  <chr> <chr>   <chr>   <chr>  
 1 chr1  HAVANA gene      11869 14409 .     +      .     ENSG00… transc… DDX11L1
 2 chr1  HAVANA transcri… 11869 14409 .     +      .     ENSG00… transc… DDX11L1
 3 chr1  HAVANA exon      11869 12227 .     +      .     ENSG00… transc… DDX11L1
 4 chr1  HAVANA exon      12613 12721 .     +      .     ENSG00… transc… DDX11L1
 5 chr1  HAVANA exon      13221 14409 .     +      .     ENSG00… transc… DDX11L1
 6 chr1  HAVANA transcri… 12010 13670 .     +      .     ENSG00… transc… DDX11L1
 7 chr1  HAVANA exon      12010 12057 .     +      .     ENSG00… transc… DDX11L1
 8 chr1  HAVANA exon      12179 12227 .     +      .     ENSG00… transc… DDX11L1
 9 chr1  HAVANA exon      12613 12697 .     +      .     ENSG00… transc… DDX11L1
10 chr1  HAVANA exon      12975 13052 .     +      .     ENSG00… transc… DDX11L1
# … with 84 more rows, 13 more variables: level <chr>, hgnc_id <chr>,
#   havana_gene <chr>, transcript_id <chr>, transcript_type <chr>,
#   transcript_name <chr>, transcript_support_level <chr>, tag <chr>,
#   havana_transcript <chr>, exon_number <chr>, exon_id <chr>, ont <chr>,
#   protein_id <chr>, and abbreviated variable names ¹gene_type, ²gene_nameThe following steps can be used to split a column containing key-value pairs into separate columns:
separate_rows to split a single column into
rowsseparate to split a key-value pair into two
columnspivot_wider to convert the long format table back
to wide formatHowever, sometimes data is packed into a single column because it cannot be nicely formatted in the first place.
sessionInfo()R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
 [1] importbio_0.0.0.9000 remotes_2.4.2        forcats_0.5.1       
 [4] stringr_1.4.0        dplyr_1.0.9          purrr_0.3.4         
 [7] readr_2.1.2          tidyr_1.2.0          tibble_3.1.8        
[10] ggplot2_3.3.6        tidyverse_1.3.1      workflowr_1.7.0     
loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3     lubridate_1.8.0  getPass_0.2-2    ps_1.7.0        
 [5] assertthat_0.2.1 rprojroot_2.0.3  digest_0.6.29    utf8_1.2.2      
 [9] R6_2.5.1         cellranger_1.1.0 backports_1.4.1  reprex_2.0.1    
[13] evaluate_0.15    httr_1.4.3       pillar_1.8.1     rlang_1.0.4     
[17] curl_4.3.2       readxl_1.4.0     rstudioapi_0.13  whisker_0.4     
[21] callr_3.7.0      jquerylib_0.1.4  rmarkdown_2.14   bit_4.0.4       
[25] munsell_0.5.0    broom_0.8.0      compiler_4.2.0   httpuv_1.6.5    
[29] modelr_0.1.8     xfun_0.31        pkgconfig_2.0.3  htmltools_0.5.2 
[33] tidyselect_1.1.2 fansi_1.0.3      crayon_1.5.1     tzdb_0.3.0      
[37] dbplyr_2.1.1     withr_2.5.0      later_1.3.0      grid_4.2.0      
[41] jsonlite_1.8.0   gtable_0.3.0     lifecycle_1.0.1  DBI_1.1.2       
[45] git2r_0.30.1     magrittr_2.0.3   scales_1.2.0     vroom_1.5.7     
[49] cli_3.3.0        stringi_1.7.6    fs_1.5.2         promises_1.2.0.1
[53] xml2_1.3.3       bslib_0.3.1      ellipsis_0.3.2   generics_0.1.3  
[57] vctrs_0.4.1      tools_4.2.0      bit64_4.0.5      glue_1.6.2      
[61] hms_1.1.2        parallel_4.2.0   processx_3.5.3   fastmap_1.1.0   
[65] yaml_2.3.5       colorspace_2.0-3 rvest_1.0.2      knitr_1.39      
[69] haven_2.5.0      sass_0.4.1