Descriptive Statistics Final Project

By Michael Eryan

Analysis of Cards: Project Description

"In this project, you will demonstrate what you have learned in this course by conducting an experiment dealing with drawing from a deck of playing cards and creating a writeup containing your findings."

In [13]:
#Load libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from scipy import stats
from ggplot import *
from scipy.stats import norm

%matplotlib inline

import os
os.chdir(r"C:\Users\rf\Google Drive\Education\R\Udacity\ud827")

d=pd.read_csv('deck.csv')
#print d

#calculate basic descriptive stats population: 6.54 and 3.18 is sample, but this is the population
#print d.describe()
print "Course quiz answers."
print "Measures of central tendency: Mean is",round(d.mean()[0],2), "Median is",round(d.median()[0],2)
print "Measure of spread: Population standard deviation is",round(np.std(d, ddof=0),2)
print "Answers accepted by the autograder."
Course quiz answers.
Measures of central tendency: Mean is 6.54 Median is 7.0
Measure of spread: Population standard deviation is 3.15
Answers accepted by the autograder.

Question 1: Plotting a histogram of card values (whole deck)

I chose to write my own code to answer the questions below.

In [14]:
#Question 1: Plotting a histogram  of card values (whole deck)
print ggplot(aes(x='value'), data=d) +\
  geom_histogram(binwidth = 1, fill='steelblue') +\
  ggtitle('Histogram of card values') + xlab('Card Value') + ylab('Count')   
#The population distribution is not normal. It's more like uniform except the named cards.
<ggplot: (11791315)>

Question 1: Plotting a histogram of card values (whole deck)

The population distribution is not normal. It's more like a uniform distribution with the named cards being outliers.

In [6]:
#Question 2: Obtain samples from a deck of cards
#pull random samples from it now
#drawing a random sample - have mean, median, sum, std?

def draw_mean(d, n=3):
    one = d.take(np.random.permutation(len(d))[:n])
    mean = one['value'].mean(axis=0)
    median = one['value'].median(axis=0)
    sum = one['value'].sum(axis=0)
    std  = one['value'].std(axis=0)
    return mean, median,sum, std
    
#draw_mean(d)

#Now create a loop to stack 1,000 these into a table
def stack(d1,n1,m1):
    list1 = []
    for i in range(m1):
        list1.append(draw_mean(d1,n=n1))
    return list1

#two columns, mean and std
d2 =DataFrame(stack(d,3,1000), columns=['mean', 'median','sum','std'])
#also export this dataframe of means and std of each sample
d2.to_csv('sample_means.csv')

Question 2: Obtain samples from a deck of cards

Created own functions to draw random samples of 3 cards from the deck with replacement. Calculated the sum and other statistics for each 3 card sample and output the results into "sample_means.csv" for the reviewer.

In [7]:
#Question 3: Report descriptive statistics regarding samples taken
#Measures of central tendency: mean and median
#print d2.describe()
print "Measures of central tendency of the sample sums: Mean is",round(d2['sum'].mean(),2), "Median is",round(d2['sum'].median(),2)
print "Measure of spread: Sample standard deviation is",round(np.std(d2['sum'], ddof=1),2), "Interquantile range is",round((d2['sum'].quantile(0.75) - d2['sum'].quantile(0.25)),2)
Measures of central tendency of the sample sums: Mean is 19.6 Median is 20.0
Measure of spread: Sample standard deviation is 5.29 Interquantile range is 7.0

Question 3: Report descriptive statistics regarding samples taken

The distribution of the 1,000 3 card samples drawn from a deck with replacement has the following descriptive statistics:

  • Measures of central tendency: mean is 19.6 and median is 20.0
  • Measures of spread: Sample standard deviation is 5.29 and interquantile range is 7.0
In [8]:
#Question 4: Plotting a histogram of sampled values (1,000 samples)
#print d2['sum'].hist()
print ggplot(aes(x='sum'), data=d2) +\
  geom_histogram(binwidth = 1, fill='steelblue') +\
  xlim(3,30) +\
  ggtitle('Histogram of sample sums') + xlab('Sum of 3 cards') + ylab('Count')  
<ggplot: (11754097)>

Question 4: Plotting a histogram of sampled values (1,000 samples)

The distribution of the sums of the random 3 card samples drawn from the deck (aka the population) appear somewhat normal whereas the distribution of the population itself was not normal.

This is an illustration of the central limit theorem: "when independent random variables are added, their sum tends toward a normal distribution (commonly known as a bell curve) even if the original variables themselves are not normally distributed"

Source: https://en.wikipedia.org/wiki/Central_limit_theorem

In [12]:
#Question 5: Making estimates based on the sampled distribution.
#This asks about the distribution of sums, not the average value of sums. 
#So, just use the standard deviation from this distribution. 

#need mean, standard error
mean1 = d2['sum'].mean()
#calculate standard error, ci, t-test?
sd1 = np.std(d2['sum'], ddof=1)

#Answer 5A: the 90% confidence interval
#PPF is the Percent point function (inverse of cdf — percentiles) - gives the z-values
cu = norm.ppf(0.95, loc=mean1, scale=sd1)
cl = norm.ppf(0.05, loc=mean1, scale=sd1)
print "This distribution has a mean of",round(mean1,2),"and standard deviation of",round(sd1,2)
print "So, the 90% confidence interval for future random draws of 3 cards is (",round(cl,2),",",round(cu,2),")"

#Answer 5B: the probability we will get a draw with sum of at least 20
#CDF is the cumulative density function - the area to the left of the z-statistic
prob_lt20 = 1.0 - norm.cdf(20, loc=mean1, scale=sd1)
print "The probability of drawing 3 cards with the sum of at least 20 is pretty high:",round(prob_lt20,2)
#OK
#I chose these functions from scipy.stats library because of convenience. 
#I could have also calculated z-stat=(hypothesized - mean)/std and the looked up the values in a the z-table.
#But the PPF and CDF functions above do that for me when I provide the needed parameters. 
This distribution has a mean of 19.6 and standard deviation of 5.29
So, the 90% confidence interval for future random draws of 3 cards is ( 10.89 , 28.3 )
The probability of drawing 3 cards with the sum of at least 20 is pretty high: 0.47

Question 5: Making estimates based on the sampled distribution.

Note: these are analytical estimates assuming this sample is normally distributed, which is a pretty strong assumption given the histogram above.

Answer 5A: the 90% confidence interval

  • This distribution has a mean of 19.6 and standard deviation of 5.29
  • So, the 90% confidence interval for future random draws of 3 cards is ( 10.89 , 28.3 ). In other words, if we draw another say 1,000 samples of 3 cards, 900 samples are expected to have a sum between 10.89 and 28.3.

Answer 5B: the probability we will get a draw a sample with sum of at least 20

  • The probability of drawing 3 cards with the sum of at least 20 is pretty high: 0.47. In other words, 47% of samples are expected to have a sum of at least 20.

Note that I chose the functions from scipy.stats library because of convenience. I could have also calculated z-stat=(hypothesized - mean)/std manually and then looked up the values in a the z-table. But the PPF and CDF functions above do that for me when I provide the needed parameters.

The End