SQL

Codecademy
DataQuest
- Introduction to SQL
- Joining Data in SQL
Further resources

Last updated: 2019-08-20

Checks: 7 0

Knit directory: for-future-reference/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0.9000). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190125)

The command set.seed(20190125) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f2b843a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	f2b843a	John Blischak	2019-08-21	Add notes on SQL.

Codecademy

SELECT * FROM movies;

SELECT name, genre
FROM movies;

SELECT name AS 'Titles'
FROM movies;

SELECT DISTINCT genre
from movies;

SELECT *
FROM movies
WHERE imdb_rating < 5;

# _ = single character wildcard
SELECT *
FROM movies
WHERE name LIKE 'Se_en';

# % = zero or more characters wildcard
SELECT *
FROM movies
WHERE name LIKE '%man%';

SELECT name
FROM movies
WHERE imdb_rating IS NOT NULL;

SELECT *
FROM movies
WHERE year BETWEEN 1990 AND 1999;

# Would include movie named 'J' but not 'JAWS'
SELECT *
FROM movies
WHERE name BETWEEN 'A' AND 'J';

SELECT *
FROM movies
WHERE year BETWEEN 1970 and 1979;

SELECT *
FROM movies
WHERE year BETWEEN 1970 AND 1979
   AND imdb_rating > 8;

SELECT *
FROM movies
WHERE year > 2014
   OR genre = 'action';

Note: ORDER BY always goes after WHERE (if WHERE is present).

SELECT name, year
FROM movies
ORDER BY name;

SELECT name, year, imdb_rating
FROM movies
ORDER BY imdb_rating DESC;

LIMIT always goes at the very end of the query. Also, it is not supported in all SQL databases.

SELECT *
FROM movies
ORDER BY imdb_rating DESC
LIMIT 3;

SELECT name,
 CASE
  WHEN genre = 'romance' THEN 'Chill'
  WHEN genre = 'comedy' THEN 'Chill'
  ELSE 'Intense'
 END AS 'Mood'
FROM movies;

SELECT is the clause we use every time we want to query information from a database.
AS renames a column or table.
DISTINCT return unique values.
WHERE is a popular command that lets you filter the results of the query based on conditions that you specify.
LIKE and BETWEEN are special operators.
AND and OR combines multiple conditions.
ORDER BY sorts the result.
LIMIT specifies the maximum number of rows that the query will return.
CASE creates different outputs.

DataQuest

SQL Fundamentals for R Users course

Introduction to SQL

SELECT * from recent_grads LIMIT 10

SELECT Major, ShareWomen from recent_grads WHERE ShareWomen < 0.5

SELECT Major, Major_category, Median, ShareWomen from recent_grads WHERE ShareWomen > 0.5 AND Median > 50000

SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE Major_category = 'Engineering'
AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)

SELECT Major, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE ShareWomen > 0.3 AND Unemployment_rate < 0.1
ORDER BY ShareWomen DESC

SELECT Major_category, Major, Unemployment_rate
FROM recent_grads
WHERE Major_category = 'Engineering' OR Major_category = 'Physical Sciences'
ORDER BY Unemployment_rate

SELECT c.*, f.name AS 'country_name'
FROM facts AS f
INNER JOIN cities as c
ON c.facts_id = f.id
LIMIT 5;

most DBMS require that the SELECT and FROM statements come first followed by any other statements such as WHERE.

Joining Data in SQL

Syntax for an inner join:

SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];

SELECT *
FROM facts
INNER JOIN cities
ON cities.facts_id = facts.id
LIMIT 10;

SELECT f.name AS country, c.name as capital_city
FROM cities c
INNER JOIN facts f ON f.id = c.facts_id
WHERE c.capital = 1;

SELECT f.name AS country, f.population
FROM facts f
LEFT JOIN cities c ON c.facts_id = f.id
WHERE c.name IS NULL;

SELECT c.name AS capital_city, f.name AS country, c.population
FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
WHERE capital = 1
ORDER BY c.population DESC
LIMIT 10;

SELECT c.name capital_city, f.name country, c.population
FROM facts f
INNER JOIN (
            SELECT * FROM cities
            WHERE capital = 1
               AND population > 10000000
            ) c ON c.facts_id = f.id
ORDER BY c.population DESC
LIMIT 10;

Further resources

SQLZOO

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.4.0.9000 Rcpp_1.0.0           digest_0.6.13       
 [4] rprojroot_1.3-2      backports_1.1.2      git2r_0.26.1        
 [7] magrittr_1.5         evaluate_0.10.1      stringi_1.3.1       
[10] fs_1.3.1             whisker_0.3-2        rmarkdown_1.10.14   
[13] htmldeps_0.1.1       tools_3.3.3          stringr_1.4.0       
[16] glue_1.3.1           xfun_0.3             yaml_2.2.0          
[19] htmltools_0.3.6      knitr_1.21

SQL

John Blischak

2019-08-20

Codecademy

DataQuest

Introduction to SQL

Joining Data in SQL

Further resources