Last updated: 2024-06-04

Checks: 7 0

Knit directory: bioinformatics_tips/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200503) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 959c255. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/security.Rmd) and HTML (docs/security.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 959c255 Dave Tang 2024-06-04 Hybrid cryptosystem
html 8cff360 Dave Tang 2024-06-04 Build site.
Rmd a5b3e4c Dave Tang 2024-06-04 Public-key cryptosystem
html 7ebd7ec Dave Tang 2024-05-26 Build site.
Rmd 6d95a71 Dave Tang 2024-05-26 Example using gpg
Rmd 11f44f8 Dave Tang 2023-12-05 GnuPG concepts
html a400d25 Dave Tang 2023-06-27 Build site.
Rmd e4ad29d Dave Tang 2023-06-27 Update jquery
html c6a497c davetang 2020-06-21 Build site.
Rmd 3b81b96 davetang 2020-06-21 Queuing systems
html 02c559c davetang 2020-06-08 Build site.
Rmd 9019e42 davetang 2020-06-08 Shared-key cryptosystem
html f851e18 davetang 2020-06-07 Build site.
Rmd a36c1e8 davetang 2020-06-07 Security basics

The Internet has become an integral part of our lives. When exchanging data over the internet, the data passes through various networks and devices. There are four problems that can occur when data is transferred from one party to another:

  1. Interception
  2. Spoofing
  3. Falsification
  4. Repudication

These four problems are countered by:

  1. Encryption
  2. Message authenitcation codes (MACs) or digital signatures
  3. MACs or digital signatures
  4. Digital signatures and certificates

Encryption

Encryption means performing an operation on data such that a computer cannot decipher into something meaningful, i.e. turn data into ciphertext. A key is typically used to perform the encryption’s numeric calculation and the same key is used to decrypt the encrypted data. One way of achieving this is by using a XOR cipher; XOR (exclusive or) is an operation that works like OR but returns zero when both conditions are true.

If our data (in binary) is 00110011 and our key is 11110000 then:

If we use the same key on the ciphertext, we obtain the original data:

Hash functions

A hash function converts data into a random string of fixed length. The MD5 message-digest algorithm is a widely used (but outdated) hash function that produces a 128-bit hash value.

echo hello world | md5sum
6f5902ac237024bdd0c176cb93063dc4  -

The output is in hexadecimal (0-9 then A-F), which requires 4 bits to represent because F in hexadecimal is 1111 in binary. Therefore the 32 long hexadecimal number is 32*4 bits. Any data used as input into the MD5 hash function will return a 128-bit hash value or a length 32 hexadecimal number.

echo abc | md5sum
0bee89b07a248e27c83fc3d5951213c1  -

When given the same input, a hash function will invariably produce the same output.

echo hello world | md5sum
6f5902ac237024bdd0c176cb93063dc4  -

However, if the input data only differs by a single bit, the output is very different.

echo hell world | md5sum
a3723e12600ef5c0456c201f5e8c7a37  -

Sometimes, completely different data can produce identical hash values but this has a very low probability and is known as a hash collision. Finally, it is impossible to convert hash values back into their original data.

Shared-key cryptosystem

Shared-key or symmetric-key cryptosystems use the same key for encryption and decryption. The Advanced Encryption Standard is the first (and only) publicly accessible cipher approved by the National Security Agency (NSA).

The problem with shared-key systems is that in order for the receiving party to decrypt the encrypted file, the key needs to be transferred as well. A secure method is necessary for transmitting keys, i.e. performing a key-exchange. There are two types of methods:

  1. Methods using key-exchange protocols
  2. Methods using the public-key cryptosystem

Public-key cryptosystem

Unlike the shared-key cryptosystem, the public-key cryptosystem uses different keys for encryption and decryption. The key used for encryption is called a “public key” and the key used for decryption is called a “secret key”.

Compared to the shared-key cryptosystem, the public-key cryptosystem tends to take more time for both encryption and decryption. Some examples of methods for public-key encryption are RSA encryption and elliptic curve cryptography.

If party A wants to send data to party B over the Internet, the receiver (party B) creates a public and secret key. The public key is sent to party A and used to encrypt the data. The ciphertext is sent to party B and party B uses the secret key to decrypt the ciphertext from party A.

Even if the public key and ciphertext are intercepted, it’s fine because the ciphertext can only be decrypted with the secret key (not that not transmitted). Therefore unlike the shared-key cryptosystem, there is no key delivery problem with the public-key cryptosystem.

The public-key cryptosystem has another advantage of making it easy to exchange information among an unspecified number of parties. For example, if party B has prepared public and secret keys in advance and shared the public key online. Multiple parties can use the public key to encrypt data intended for party B. There is no need for multiple keys for multiple parties.

However there are two problems with the public-key cryptosystem:

  1. It takes time for both encryption and decryption and because of this, it’s not suitable for exchanging small amounts of data back and forth; the solution is to use a hybrid cryptosystem
  2. Reliability of public keys; there is no way to show who created a public key and a “person-in-the-middle attack” can occur

The second problem is due to the fact that another party cannot verify whether or not a public key was created by the a particular person. To address this issue, a system of digital certificates is used.

Hybrid cryptosystem

The shared-key cryptosystem suffers from the key delivery problem, where there’s a problem with securely exchanging keys. On the other hand, the public-key cryptosystem suffers from slow processing during encryption and decryption. The hybrid cryptosystem combines the two systems to make up for each other’s faults; the shared-key cryptosystem is used to quickly process data encryption and the keys used by the shared-key cryptosystem are exchanged using the public-key cryptosystem.

Let’s say party A wants to send data to party B over the Internet. The data is encrypted using the faster shared-key cryptosystem. Since party B needs the same key for decryption, party B creates public and secret keys in advance, and the public key is used by party A to encrypt the shared key. The shared key is sent to party B, along with the encrypted data, and party B can decrypt the shared key using their secret key and finally use it to decrypt the encrypted data.

The hybrid cryptosystem is used in SSL.

GnuPG concepts

GnuPG makes use of several cryptographic concepts including:

Symmetric ciphers

A symmetric cipher is a cipher that uses the same key for both encryption and decryption. Two parties communicating using a symmetric cipher must agree on the key beforehand. Once they agree, the sender encrypts a message using the key, sends it to the receiver, and the receiver decrypts the message using the key.

As an example, the German Enigma is a symmetric cipher, and daily keys were distributed as code books. Each day, a sending or receiving radio operator would consult his copy of the code book to find the day’s key. Radio traffic for that day was then encrypted and decrypted using the day’s key. Modern examples of symmetric ciphers include 3DES, Blowfish, and IDEA.

A good cipher puts all the security in the key and none in the algorithm. In other words, it should be no help to an attacker if he knows which cipher is being used. Only if he obtains the key would knowledge of the algorithm be needed. The ciphers used in GnuPG have this property.

Since all the security is in the key, then it is important that it be very difficult to guess the key. In other words, the set of possible keys, i.e., the key space, needs to be large. While at Los Alamos, Richard Feynman was famous for his ability to crack safes. To encourage the mystique he even carried around a set of tools including an old stethoscope. In reality, he used a variety of tricks to reduce the number of combinations he had to try to a small number and then simply guessed until he found the right combination. In other words, he reduced the size of the key space.

Britain used machines to guess keys during World War 2. The German Enigma had a very large key space, but the British built specialized computing engines, the Bombes, to mechanically try keys until the day’s key was found. This meant that sometimes they found the day’s key within hours of the new key’s use, but it also meant that on some days they never did find the right key. The Bombes were not general-purpose computers but were precursors to modern-day computers.

Today, computers can guess keys very quickly, and this is why key size is important in modern cryptosystems. The cipher DES uses a 56-bit key, which means that there are \(2^{56}\) possible keys (72,057,594,037,927,936 keys). This is a lot of keys, but a general-purpose computer can check the entire key space in a matter of days. A specialized computer can check it in hours. On the other hand, more recently designed ciphers such as 3DES, Blowfish, and IDEA all use 128-bit keys, which means there are \(2^{128}\) possible keys. This is many, many more keys, and even if all the computers on the planet cooperated, it could still take more time than the age of the universe to find the key.

Public-key ciphers

The primary problem with symmetric ciphers is not their security but with key exchange. Once the sender and receiver have exchanged keys, that key can be used to securely communicate, but what secure communication channel was used to communicate the key itself? In particular, it would probably be much easier for an attacker to work to intercept the key than it is to try all the keys in the key space. Another problem is the number of keys needed. If there are n people who need to communicate, then n(n-1)/2 keys are needed for each pair of people to communicate privately. This may be OK for a small number of people but quickly becomes unwieldy for large groups of people.

Public-key ciphers were invented to avoid the key-exchange problem entirely. A public-key cipher uses a pair of keys for sending messages. The two keys belong to the person receiving the message. One key is a public key and may be given to anybody. The other key is a private key and is kept secret by the owner. A sender encrypts a message using the public key and once encrypted, only the private key may be used to decrypt it.

This protocol solves the key-exchange problem inherent with symmetric ciphers. There is no need for the sender and receiver to agree upon a key. All that is required is that some time before secret communication the sender gets a copy of the receiver’s public key. Furthermore, the one public key can be used by anybody wishing to communicate with the receiver. So only n keypairs are needed for n people to communicate secretly with one another.

Public-key ciphers are based on one-way trapdoor functions. A one-way function is a function that is easy to compute, but the inverse is hard to compute. For example, it is easy to multiply two prime numbers together to get a composite, but it is difficult to factor a composite into its prime components. A one-way trapdoor function is similar, but it has a trapdoor. That is, if some piece of information is known, it becomes easy to compute the inverse. For example, if you have a number made of two prime factors, then knowing one of the factors makes it easy to compute the second. Given a public-key cipher based on prime factorization, the public key contains a composite number made from two large prime factors, and the encryption algorithm uses that composite to encrypt the message. The algorithm to decrypt the message requires knowing the prime factors, so decryption is easy if you have the private key containing one of the factors but extremely difficult if you do not have it.

As with good symmetric ciphers, with a good public-key cipher all of the security rests with the key. Therefore, key size is a measure of the system’s security, but one cannot compare the size of a symmetric cipher key and a public-key cipher key as a measure of their relative security. In a brute-force attack on a symmetric cipher with a key size of 80 bits, the attacker must enumerate up to \(2^{80}\) keys to find the right key. In a brute-force attack on a public-key cipher with a key size of 512 bits, the attacker must factor a composite number encoded in 512 bits (up to 155 decimal digits). The workload for the attacker is fundamentally different depending on the cipher he is attacking. While 128 bits is sufficient for symmetric ciphers, given today’s factoring technology public keys with 1024 bits are recommended for most purposes.

Hybrid ciphers

Public-key ciphers are no panacea. Many symmetric ciphers are stronger from a security standpoint, and public-key encryption and decryption are more expensive than the corresponding operations in symmetric systems. Public-key ciphers are nevertheless an effective tool for distributing symmetric cipher keys, and that is how they are used in hybrid cipher systems.

A hybrid cipher uses both a symmetric cipher and a public-key cipher. It works by using a public-key cipher to share a key for the symmetric cipher. The actual message being sent is then encrypted using the key and sent to the recipient. Since symmetric key sharing is secure, the symmetric key used is different for each message sent. Hence it is sometimes called a session key.

Both PGP and GnuPG use hybrid ciphers. The session key, encrypted using the public-key cipher, and the message being sent, encrypted with the symmetric cipher, are automatically combined in one package. The recipient uses his private-key to decrypt the session key and the session key is then used to decrypt the message.

A hybrid cipher is no stronger than the public-key cipher or symmetric cipher it uses, whichever is weaker. In PGP and GnuPG, the public-key cipher is probably the weaker of the pair. Fortunately, however, if an attacker could decrypt a session key it would only be useful for reading the one message encrypted with that session key. The attacker would have to start over and decrypt another session key in order to read any other message.

Digital signatures

A hash function is a many-to-one function that maps its input to a value in a finite set. Typically this set is a range of natural numbers. A simple hash function is \(f(x) = 0\) for all integers \(x\). A more interesting hash function is \(f(x) = x mod 37\), which maps \(x\) to the remainder of dividing \(x\) by 37.

A document’s digital signature is the result of applying a hash function to the document. To be useful, however, the hash function needs to satisfy two important properties. First, it should be hard to find two documents that hash to the same value. Second, given a hash value it should be hard to recover the document that produced that value.

Some public-key ciphers could be used to sign documents. The signer encrypts the document with his private key. Anybody wishing to check the signature and see the document simply uses the signer’s public key to decrypt the document. This algorithm does satisfy the two properties needed from a good hash function, but in practice, this algorithm is too slow to be useful.

An alternative is to use hash functions designed to satisfy these two important properties. SHA and MD5 are examples of such algorithms. Using such an algorithm, a document is signed by hashing it, and the hash value is the signature. Another person can check the signature by also hashing their copy of the document and comparing the hash value they get with the hash value of the original document. If they match, it is almost certain that the documents are identical.

Of course, the problem now is using a hash function for digital signatures without permitting an attacker to interfere with signature checking. If the document and signature are sent unencrypted, an attacker could modify the document and generate a corresponding signature without the recipient’s knowledge. If only the document is encrypted, an attacker could tamper with the signature and cause a signature check to fail. A third option is to use a hybrid public-key encryption to encrypt both the signature and document. The signer uses his private key, and anybody can use his public key to check the signature and document. This sounds good but is actually nonsense. If this algorithm truly secured the document it would also secure it from tampering and there would be no need for the signature. The more serious problem, however, is that this does not protect either the signature or document from tampering. With this algorithm, only the session key for the symmetric cipher is encrypted using the signer’s private key. Anybody can use the public key to recover the session key. Therefore, it is straightforward for an attacker to recover the session key and use it to encrypt substitute documents and signatures to send to others in the sender’s name.

An algorithm that does work is to use a public key algorithm to encrypt only the signature. In particular, the hash value is encrypted using the signer’s private key, and anybody can check the signature using the public key. The signed document can be sent using any other encryption algorithm including none if it is a public document. If the document is modified the signature check will fail, but this is precisely what the signature check is supposed to catch. The Digital Signature Standard (DSA) is a public key signature algorithm that works as just described. DSA is the primary signing algorithm used in GnuPG.

Using GPG to Encrypt and Decrypt Files

Generate a key and check.

gpg --full-generate-key
gpg --list-secret-keys

Encrypt.

echo Hi > test.txt
gpg --encrypt --output test.txt.gpg --recipient me@davetang.org test.txt

Decrypt.

gpg --decrypt --output test2.txt test.txt.gpg

sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       httr_1.4.7        cli_3.6.2         knitr_1.46       
 [5] rlang_1.1.3       xfun_0.44         stringi_1.8.4     processx_3.8.4   
 [9] promises_1.3.0    jsonlite_1.8.8    glue_1.7.0        rprojroot_2.0.4  
[13] git2r_0.33.0      htmltools_0.5.8.1 httpuv_1.6.15     ps_1.7.6         
[17] sass_0.4.9        fansi_1.0.6       rmarkdown_2.27    jquerylib_0.1.4  
[21] tibble_3.2.1      evaluate_0.23     fastmap_1.2.0     yaml_2.3.8       
[25] lifecycle_1.0.4   whisker_0.4.1     stringr_1.5.1     compiler_4.4.0   
[29] fs_1.6.4          pkgconfig_2.0.3   Rcpp_1.0.12       rstudioapi_0.16.0
[33] later_1.3.2       digest_0.6.35     R6_2.5.1          utf8_1.2.4       
[37] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    bslib_0.7.0      
[41] tools_4.4.0       cachem_1.1.0      getPass_0.2-4