2019-06-12-vinos

Last updated: 2019-06-13

Checks: 7 0

Knit directory: wflow-datos-de-miercoles/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190612)

The command set.seed(20190612) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 8115b8a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    docs/figure/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	8115b8a	John Blischak	2019-06-13	La correlación entre calidad y precio por viña.
html	d55cbc1	John Blischak	2019-06-12	Build site.
Rmd	fc40bd1	John Blischak	2019-06-12	empezar

Introducción

Preparación

library(stringr)
library(tidyverse)
library(cowplot)
theme_set(theme_cowplot())

vinos <- read_csv("data/vinos.csv")

Exploración

head(vinos)

# A tibble: 6 x 10
  pais   nombre  puntos precio provincia  region_1 region_2 variedad vina 
  <chr>  <chr>    <int>  <int> <chr>      <chr>    <chr>    <chr>    <chr>
1 Italia Vulkà …     87     NA Sicily & … Etna     <NA>     Ensambl… Nico…
2 Portu… Avidag…     87     15 Douro      <NA>     <NA>     Portugu… Quin…
3 Estad… <NA>        87     14 Oregon     Willame… Willame… Pinot G… Rain…
4 Estad… Reserv…     87     13 Michigan   Lake Mi… <NA>     Riesling St. …
5 Estad… Vintne…     87     65 Oregon     Willame… Willame… Pinot N… Swee…
6 España Ars In…     87     15 Northern … Navarra  <NA>     Tempran… Tand…
# ... with 1 more variable: titulo_resena <chr>

plot(vinos$puntos, vinos$precio)

dim(vinos)

[1] 129971     10

unique(vinos$pais)

 [1] "Italia"               "Portugal"             "Estados Unidos"      
 [4] "España"               "Francia"              "Alemania"            
 [7] "Argentina"            "Chile"                "Australia"           
[10] "Austria"              "Sudáfrica"            "Nueva Zelanda"       
[13] "Israel"               "Hungría"              "Grecia"              
[16] "Rumania"              "México"               "Canadá"              
[19] NA                     "Turquía"              "República Checa"     
[22] "Eslovenia"            "Luxemburgo"           "Croacia"             
[25] "Georgia"              "Uruguay"              "Inglaterra"          
[28] "Líbano"               "Serbia"               "Brazil"              
[31] "Moldavia"             "Marruecos"            "Perú"                
[34] "India"                "Bulgaria"             "Chipre"              
[37] "Armenia"              "Suiza"                "Bosnia y Herzegovina"
[40] "Ucrania"              "Eslovaquia"           "Macedonia"           
[43] "China"                "Egipto"

head(vinos$titulo_resena)

[1] "Nicosia 2013 Vulkà Bianco  (Etna)"                                                  
[2] "Quinta dos Avidagos 2011 Avidagos Red (Douro)"                                      
[3] "Rainstorm 2013 Pinot Gris (Willamette Valley)"                                      
[4] "St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)"                
[5] "Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)"
[6] "Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra)"

País de origen

head(vinos$pais)

[1] "Italia"         "Portugal"       "Estados Unidos" "Estados Unidos"
[5] "Estados Unidos" "España"

length(unique(vinos$pais))

[1] 44

table(vinos$pais, useNA = "ifany")


            Alemania            Argentina              Armenia 
                2165                 3800                    2 
           Australia              Austria Bosnia y Herzegovina 
                2329                 3345                    2 
              Brazil             Bulgaria               Canadá 
                  52                  141                  257 
               Chile                China               Chipre 
                4472                    1                   11 
             Croacia               Egipto           Eslovaquia 
                  73                    1                    1 
           Eslovenia               España       Estados Unidos 
                  87                 6645                54504 
             Francia              Georgia               Grecia 
               22093                   86                  466 
             Hungría                India           Inglaterra 
                 146                    9                   74 
              Israel               Italia               Líbano 
                 505                19540                   35 
          Luxemburgo            Macedonia            Marruecos 
                   6                   12                   28 
              México             Moldavia        Nueva Zelanda 
                  70                   59                 1419 
                Perú             Portugal      República Checa 
                  16                 5691                   12 
             Rumania               Serbia            Sudáfrica 
                 120                   12                 1401 
               Suiza              Turquía              Ucrania 
                   7                   90                   14 
             Uruguay                 <NA> 
                 109                   63

sum(table(vinos$pais) > 1000)

[1] 12

sum(is.na(vinos$pais))

[1] 63

Nombre del vino

head(vinos$nombre)

[1] "Vulkà Bianco"                      
[2] "Avidagos"                          
[3] NA                                  
[4] "Reserve Late Harvest"              
[5] "Vintner's Reserve Wild Child Block"
[6] "Ars In Vitro"

length(unique(vinos$nombre))

[1] 37980

# Cuantos vinos con más que 5 reseñas?
sum(table(vinos$nombre) > 5)

[1] 1995

sum(is.na(vinos$nombre))

[1] 37465

Puntos con que fue calificado (escala de 1 a 100)

head(vinos$puntos)

[1] 87 87 87 87 87 87

summary(vinos$puntos)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  80.00   86.00   88.00   88.45   91.00  100.00

hist(vinos$puntos)

stopifnot(is.integer(vinos$puntos))
sum(is.na(vinos$puntos))

[1] 0

Precio de la botella (en dólares estadounidenses)

head(vinos$precio)

[1] NA 15 14 13 65 15

summary(vinos$precio)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   4.00   17.00   25.00   35.36   42.00 3300.00    8996

hist(vinos$precio)

stopifnot(is.integer(vinos$precio))
sum(is.na(vinos$precio))

[1] 8996

sum(vinos$precio > 1000, na.rm = TRUE)

[1] 14

Variedad (por ejemplo, Pinot Noir, Cabernet Sauvignon, etc.)

head(vinos$variedad)

[1] "Ensamblaje Blanco"  "Portuguese Red"     "Pinot Gris"        
[4] "Riesling"           "Pinot Noir"         "Tempranillo-Merlot"

length(unique(vinos$variedad))

[1] 708

sum(table(vinos$variedad) > 1000)

[1] 28

sum(is.na(vinos$variedad))

[1] 1

Nombre de la viña que produce el vino

head(vinos$vina)

[1] "Nicosia"             "Quinta dos Avidagos" "Rainstorm"          
[4] "St. Julian"          "Sweet Cheeks"        "Tandem"

length(unique(vinos$vina))

[1] 16757

sum(table(vinos$vina) > 100)

[1] 33

sum(is.na(vinos$vina))

[1] 0

Título de la reseña. No están traducidos al español. En la mayoría de los casos solo indica el año de cosecha y el nombre del vino

head(vinos$titulo_resena)

[1] "Nicosia 2013 Vulkà Bianco  (Etna)"                                                  
[2] "Quinta dos Avidagos 2011 Avidagos Red (Douro)"                                      
[3] "Rainstorm 2013 Pinot Gris (Willamette Valley)"                                      
[4] "St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore)"                
[5] "Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley)"
[6] "Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra)"

length(unique(vinos$titulo_resena))

[1] 118840

sum(table(vinos$titulo_resena) > 5)

[1] 13

sum(is.na(vinos$titulo_resena))

[1] 0

Tenemos el año de cosecha, pero no el año de la reseña.

ano <- str_extract(vinos$titulo_resena, "[12][09][0-9]{2}")
ano <- as.numeric(ano)
summary(ano)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1000    2009    2011    2011    2013    2017    4626

hist(ano)

Limpieza

Remover las comillas en los nombres de los vinos.

vinos %>% filter(str_detect(nombre, "^‘"))

# A tibble: 2 x 10
  pais   nombre  puntos precio provincia  region_1 region_2 variedad vina 
  <chr>  <chr>    <int>  <int> <chr>      <chr>    <chr>    <chr>    <chr>
1 Estad… ‘Rough…     89     20 Washington Columbi… Columbi… Ensambl… Barr…
2 Estad… ‘S'         92     45 Oregon     Chehale… Willame… Pinot N… Lach…
# ... with 1 more variable: titulo_resena <chr>

vinos %>% filter(nombre == "'S'" | nombre == "‘S'")

# A tibble: 2 x 10
  pais   nombre puntos precio provincia region_1  region_2  variedad vina 
  <chr>  <chr>   <int>  <int> <chr>     <chr>     <chr>     <chr>    <chr>
1 Estad… 'S'        89     50 Oregon    Chehalem… Willamet… Pinot N… Lach…
2 Estad… ‘S'        92     45 Oregon    Chehalem… Willamet… Pinot N… Lach…
# ... with 1 more variable: titulo_resena <chr>

vinos$nombre <- str_replace_all(vinos$nombre, "‘", "")
vinos$nombre <- str_replace_all(vinos$nombre, "'", "")
sum(str_detect(vinos$nombre, "‘"), na.rm = TRUE)

[1] 0

Guardar las viñas con más de 15 vinos con reseñas.

guarda_vinas <- vinas <- vinos %>%
  select(nombre, vina) %>%
  na.omit() %>%
  group_by(vina) %>%
  summarize(n_vinos = length(unique(nombre))) %>%
  filter(n_vinos > 15)

v <- vinos %>%
  select(nombre, puntos, precio, variedad, vina) %>%
  na.omit() %>%
  semi_join(guarda_vinas, by = "vina")
dim(v)

[1] 21903     5

stopifnot(length(intersect(guarda_vinas$vina, v$vina)) == nrow(guarda_vinas))

Calcula el promedio de las reseñas para el mismo vino:

v <- v %>%
  group_by(nombre, variedad, vina) %>%
  summarize(puntos = mean(puntos),
            precio = mean(precio)) %>%
  ungroup()

La correlación entre calidad y precio por viña

v_corr <- v %>%
  group_by(vina) %>%
  summarize(n = n(),
            corr = cor(puntos, precio))

Warning in cor(puntos, precio): the standard deviation is zero

p_hist <- ggplot(v_corr, aes(x = corr)) +
  geom_histogram(binwidth = 0.025) +
  labs(x = "Correlación entre calidad y precio",
       y = "Número de viñas",
       title = "Distribución de correlación por viña")
p_hist

Warning: Removed 1 rows containing non-finite values (stat_bin).

Lo más positivo.

(vina_pos <- v_corr %>% filter(corr == max(corr, na.rm = TRUE)))

# A tibble: 1 x 3
  vina                   n  corr
  <chr>              <int> <dbl>
1 Marchesi di Barolo    17 0.891

p_pos <- v %>%
  filter(vina == vina_pos$vina) %>%
  ggplot(aes(x = puntos, y = precio)) +
  geom_point() +
  labs(x = "Calidad (puntos 1 a 100)",
       y = "Precio (USD)",
       title = vina_pos$vina,
       subtitle = sprintf("Correlación: %.2f", vina_pos$corr))
p_pos

Lo más negativo.

(vina_neg <- v_corr %>% filter(corr == min(corr, na.rm = TRUE)))

# A tibble: 1 x 3
  vina        n   corr
  <chr>   <int>  <dbl>
1 Sanguis    21 -0.442

p_neg <- v %>%
  filter(vina == vina_neg$vina) %>%
  ggplot(aes(x = puntos, y = precio)) +
  geom_point() +
  labs(x = "Calidad (puntos 1 a 100)",
       y = "Precio (USD)",
       title = vina_neg$vina,
       subtitle = sprintf("Correlación: %.2f", vina_neg$corr))
p_neg

Ningún relación.

(vina_nin <- v_corr %>% filter(is.na(corr)))

# A tibble: 1 x 3
  vina                       n  corr
  <chr>                  <int> <dbl>
1 Spring Valley Vineyard    20    NA

p_nin <- v %>%
  filter(vina == vina_nin$vina) %>%
  ggplot(aes(x = puntos, y = precio)) +
  geom_point() +
  labs(x = "Calidad (puntos 1 a 100)",
       y = "Precio (USD)",
       title = vina_nin$vina,
       subtitle = "No correlación")
p_nin

plot_grid(p_hist, p_pos, p_neg, p_nin)

Warning: Removed 1 rows containing non-finite values (stat_bin).

sessionInfo()

R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2  cowplot_0.9.2   forcats_0.3.0   dplyr_0.7.7    
 [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
 [9] ggplot2_3.1.0   tidyverse_1.2.1 stringr_1.4.0  

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.3  xfun_0.3          reshape2_1.4.3   
 [4] haven_1.1.0       lattice_0.20-35   colorspace_1.3-2 
 [7] htmltools_0.3.6   yaml_2.2.0        utf8_1.1.3       
[10] rlang_0.3.0.1     pillar_1.2.2      withr_2.1.2.9000 
[13] foreign_0.8-69    glue_1.2.0.9000   modelr_0.1.2     
[16] readxl_1.0.0      bindr_0.1.1       plyr_1.8.4       
[19] munsell_0.5.0     gtable_0.2.0      workflowr_1.4.0  
[22] cellranger_1.1.0  rvest_0.3.2       psych_1.8.4      
[25] evaluate_0.10.1   labeling_0.3      knitr_1.21       
[28] parallel_3.3.3    broom_0.4.4       Rcpp_1.0.0       
[31] backports_1.1.2   scales_0.5.0      jsonlite_1.5     
[34] fs_1.2.6          mnormt_1.5-5      hms_0.4.2        
[37] digest_0.6.13     stringi_1.3.1     grid_3.3.3       
[40] rprojroot_1.3-2   cli_1.0.0         tools_3.3.3      
[43] magrittr_1.5      lazyeval_0.2.1    crayon_1.3.4     
[46] whisker_0.3-2     pkgconfig_2.0.1   xml2_1.1.1       
[49] lubridate_1.7.1   assertthat_0.2.0  rmarkdown_1.10.14
[52] httr_1.3.1        rstudioapi_0.8    htmldeps_0.1.1   
[55] R6_2.2.2          nlme_3.1-131      git2r_0.25.2.9008