Data structures

Last updated: 2020-10-11

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.6.2). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200503)

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 048aa89

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 048aa89. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Untracked files:
    Untracked:  analysis/probability.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/data_struct.Rmd) and HTML (docs/data_struct.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	048aa89	davetang	2020-10-11	More data structures
html	407b013	Dave Tang	2020-10-10	Build site.
Rmd	8520292	Dave Tang	2020-10-10	Start page on data structures

Introduction

Learn about different data structures to write more efficient programs.

Arrays

Arrays are used to store multiple values and each element can be accessed through its index, which is a number that denotes its order within the data. Data is stored sequentially in memory in consecutive locations and because of this, memory addresses can be calculated using their indices, allowing for random access of data. However, adding or deleting data in a specific location carries a high cost compared to lists. If we want to add data to an array, we need to secure additional space at the end of the array and in order to free up the space needed for the addition, data is shifted one element at a time. This is the same when deleting an element; each element is shifted one at a time.

Linked lists

Lists are used to store multiple values in a linear data structure but they are unique in how they pair data with “pointers”, the pointers indicating the next piece of data’s memory location unlike arrays. In lists, data is stored in various disjointed locations in memory and because of this, each piece of data can only be accessed through the pointer that precedes it. Addition of data is performed by simply replacing the pointers on either side of the addition.

Stacks

The structure of a stack can be imagined as a pile of objects stacked vertically. When extracting these objects, they are extracted from the top to the bottom. When adding data to a stack, the data is put into the lowest available location, like stacking blocks on top of each other; this is called a “push”. When extracting from a stack, the most recently added data is removed first; this is called a “pop”. This method of extracting the most recently added data first is called “Last In First Out” (LIFO).

Queues

Queues are a data structure that mimics a waiting line; the sooner a person lines up, the higher their priority. When adding data to a queue, the data is placed at the end and this is known as “enqueuing”. When extracting data from a queue, the data that’s been in the queue the longest is removed first and this is known as “dequeuing”. This method of extracting the initially added data first is called “First In First Out” (FIFO).

Hash tables

Hash tables, which are also known as associative arrays, are useful for storing data in sets made up of keys and values. When storing a key, a hash value is calculated using the key and a hash function and this converts the key into a value of fixed length. To store data, an array of n size is first initialised. The hash value that is calculated for a key is divided by n to determine where in the array the value should be stored, for example by using a mod operation. If another key’s hash value produces the same array index, it is stored as a linked list at that particular index (known as the chain method). There are several other types of hash table structures. When looking up a key, its hash value is calculated along with a mod operation to find its index in the array. If the array element is a list, a linear search is performed on the list to find the value of the key.

Heaps

Heaps are one type of tree data structure and are used when implementing a priority queue, which is another type of data structure. In a priority queue, data can be added in any order. When extracting data, the smallest values are chosen first. This property of being able to freely add data and then extracting the smallest values first defines a priority queue.

As a rule of heaps, a child number is always greater than its parent number. If a number is added to the tree and its number is smaller, then the child and parent swap. This operation repeats until no additional swaps occur. When extracting a number from a heap, the number on the top of the tree is removed. In a heap, the smallest value is held at the top of the tree. When the top value is extracted, the heap’s structure needs to be reorganised. The number at the end of the line moves to the top and if one of the child numbers is less than the parent, the lowest of the adjacent child numbers swap with the parent. This repeats until no additional swaps occur.

Heaps can be used to quickly extract the smallest data but extraction of data in the middle of the tree can not be performed.

Binary search trees

Binary search trees have two properties:

All nodes are greater than the nodes in their left sub-tree
All nodes are smaller than the nodes in their right sub-tree

Due to these two properties the following are true:

A binary search tree’s smallest node is located at the end of the leftmost sub-tree line stemming from the top-most node
A binary search tree’s largest node is located at the end of the rightmost sub-tree line stemming from the top-most node

To add a number to a binary search tree, we start at the top-most node. If the number to be added is smaller, it proceeds to the left. This operation is repeated for all nodes and if it is smaller than all current nodes, it is added as a new node. If the number to be added is larger than the current node, it proceeds to the right and continues traversing down the tree.

Binary search trees are used to efficiently search for numbers. Self-balancing binary search trees are well-balanced to maintain search efficiency.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.6.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5      rstudioapi_0.11 whisker_0.4     knitr_1.29     
 [5] magrittr_1.5    R6_2.4.1        rlang_0.4.7     stringr_1.4.0  
 [9] tools_4.0.2     xfun_0.16       git2r_0.27.1    htmltools_0.5.0
[13] ellipsis_0.3.1  rprojroot_1.3-2 yaml_2.2.1      digest_0.6.25  
[17] tibble_3.0.3    lifecycle_0.2.0 crayon_1.3.4    later_1.1.0.1  
[21] vctrs_0.3.4     promises_1.1.1  fs_1.5.0        glue_1.4.2     
[25] evaluate_0.14   rmarkdown_2.3   stringi_1.4.6   compiler_4.0.2 
[29] pillar_1.4.6    backports_1.1.9 httpuv_1.5.4    pkgconfig_2.0.3