Last updated: 2023-07-26
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version bd5654d. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.3.0/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Untracked: analysis/tss_xgboost.Rmd
Untracked: code/multiz100way/
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
Untracked: data/ncrna_NONCODE[v3.0].fasta.tar.gz
Untracked: data/ncrna_noncode_v3.fa
Untracked: data/netmhciipan.out.gz
Untracked: data/test
Untracked: export/davetang039sblog.WordPress.2023-06-30.xml
Untracked: export/output/
Untracked: women.json
Unstaged changes:
Modified: analysis/graph.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/reshape.Rmd
) and HTML
(docs/reshape.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | bd5654d | Dave Tang | 2023-07-26 | Convert wide and long data |
I use tidyverse packages a lot and I prefer them over base R functions, especially when it comes to plotting. However, sometimes I want to write an R script with no dependencies. This is known as using base R functions, i.e. using functions that come with R. Theoretically this means that anyone with R can run the script, if I don’t use any features introduced in later versions of R, like the base R pipe.
The pivot_longer
and pivot_wider
functions
in the tidyr package converts
data into long and wide format, respectively. You may already be
familiar with wide data; one example of wide data is a gene expression
data, where gene expression for a gene is measured and stored in
different tissues.
gene_exp <- read.delim(
file = "https://davetang.org/file/TagSeqExample.tab",
header = TRUE
)
head(gene_exp)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
Long data looks like this.
tidyr::pivot_longer(
data = gene_exp,
cols = -gene,
names_to = "sample",
values_to = "count"
) -> gene_exp_long
head(gene_exp_long)
# A tibble: 6 × 3
gene sample count
<chr> <chr> <int>
1 Gene_00001 T1a 0
2 Gene_00001 T1b 0
3 Gene_00001 T2 2
4 Gene_00001 T3 0
5 Gene_00001 N1 0
6 Gene_00001 N2 1
There are advantages to using wide and long format but I typically convert my wide data to long format for use with ggplot2.
library(ggplot2)
ggplot(gene_exp_long[1:(6*20), ], aes(gene, count, fill = sample)) +
geom_col(position = position_dodge()) +
coord_flip() +
theme_minimal() +
theme(axis.title.y = element_blank()) +
NULL
Converting long data back to wide data can be done using
pivot_wider
.
tidyr::pivot_wider(
data = gene_exp_long,
id_cols = gene,
names_from = sample,
values_from = count
)
# A tibble: 18,760 × 7
gene T1a T1b T2 T3 N1 N2
<chr> <int> <int> <int> <int> <int> <int>
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
7 Gene_00007 13 4 21 19 31 4
8 Gene_00008 0 3 0 0 0 0
9 Gene_00009 202 122 256 43 287 357
10 Gene_00010 10 8 56 145 14 15
# ℹ 18,750 more rows
How do we do this using base R?
The documentation for reshape()
describes the function
as:
This function reshapes a data frame between “wide” format (with repeated measurements in separate columns of the same row) and “long” format (with the repeated measurements in separate rows).
The documentation also shows how reshape()
is typically
used:
# reshape(data, direction = "wide",
# idvar = "___", timevar = "___", # mandatory
# v.names = c(___), # time-varying variables
# varying = list(___)) # auto-generated if missing
# reshape(data, direction = "long",
# varying = c(___), # vector
# sep) # to help guess 'v.names' and 'times'
Wide to long using reshape
.
reshape(
data = gene_exp,
direction = "long",
varying = colnames(gene_exp)[-1],
v.names = "count",
times = colnames(gene_exp)[-1],
timevar = "sample"
) -> out
# order by gene like pivot_longer
out <- out[order(out$gene), ]
# remove row names
row.names(out) <- NULL
# remove id column
out$id <- NULL
head(out)
gene sample count
1 Gene_00001 T1a 0
2 Gene_00001 T1b 0
3 Gene_00001 T2 2
4 Gene_00001 T3 0
5 Gene_00001 N1 0
6 Gene_00001 N2 1
table(out$count == gene_exp_long$count)
TRUE
112560
We achieved the same* result using reshape
but with a
bit more typing. (*I simply compared the count values above instead of
using identical
or all.equal
because
reshape
adds attributes to the object that make it
different to the pivot_longer
object.)
The arguments for varying
and times
should
be the column names of the data frame minus the variable to keep
constant. v.names
corresponds to values_to
and
timevar
corresponds to names_to
in
pivot_longer
.
Long to wide using reshape
.
reshape(
data = out,
direction = "wide",
idvar = "gene",
timevar = "sample",
v.names = "count"
) -> out2
colnames(out2) <- sub("^count\\.", "", colnames(out2))
head(gene_exp)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
2 Gene_00002 20 8 12 5 19 26
3 Gene_00003 3 0 2 0 0 0
4 Gene_00004 75 84 241 149 271 257
5 Gene_00005 10 16 4 0 4 10
6 Gene_00006 129 126 451 223 243 149
head(out2)
gene T1a T1b T2 T3 N1 N2
1 Gene_00001 0 0 2 0 0 1
7 Gene_00002 20 8 12 5 19 26
13 Gene_00003 3 0 2 0 0 0
19 Gene_00004 75 84 241 149 271 257
25 Gene_00005 10 16 4 0 4 10
31 Gene_00006 129 126 451 223 243 149
R is a statistical language and the design/implementation of functions, their arguments, and documentation reflect this. I’m not a statistician and reading the documentation for base R functions is difficult for me. This is one reason why I prefer the Tidyverse.
However, as I mentioned in the introduction, there are times when I
want an R script to have little to no dependencies. In one of my
scripts, I need to convert data back to wide format and used
pivot_wider
. But now I can use the base R function
reshape
.
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.4.2 tidyr_1.3.0 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] sass_0.4.6 utf8_1.2.3 generics_0.1.3 stringi_1.7.12
[5] digest_0.6.31 magrittr_2.0.3 evaluate_0.21 grid_4.3.0
[9] fastmap_1.1.1 rprojroot_2.0.3 jsonlite_1.8.5 processx_3.8.1
[13] whisker_0.4.1 ps_1.7.5 promises_1.2.0.1 httr_1.4.6
[17] purrr_1.0.1 fansi_1.0.4 scales_1.2.1 jquerylib_0.1.4
[21] cli_3.6.1 rlang_1.1.1 munsell_0.5.0 withr_2.5.0
[25] cachem_1.0.8 yaml_2.3.7 tools_4.3.0 dplyr_1.1.2
[29] colorspace_2.1-0 httpuv_1.6.11 vctrs_0.6.2 R6_2.5.1
[33] lifecycle_1.0.3 git2r_0.32.0 stringr_1.5.0 fs_1.6.2
[37] pkgconfig_2.0.3 callr_3.7.3 pillar_1.9.0 bslib_0.5.0
[41] later_1.3.1 gtable_0.3.3 glue_1.6.2 Rcpp_1.0.10
[45] highr_0.10 xfun_0.39 tibble_3.2.1 tidyselect_1.2.0
[49] rstudioapi_0.14 knitr_1.43 farver_2.1.1 htmltools_0.5.5
[53] rmarkdown_2.22 labeling_0.4.2 compiler_4.3.0 getPass_0.2-2