Classification

Co-authors

Prerequisites

Outcomes

  • Understand what problems classification solves
  • Evaluate classification models using a variety of metrics
In [1]:
# Uncomment following line to install on colab
#! pip install qeds fiona geopandas xgboost gensim folium pyLDAvis descartes

Introduction to Classification

We now move from regression to the second main branch of machine learning: classification.

Recall that the regression problem mapped a set of feature variables to a continuous target.

Classification is similar to regression, but instead of predicting a continuous target, classification algorithms attempt to apply one (or more) of a discrete number of labels or classes to each observation.

Another perspective is that for regression, the targets are usually continuous-valued, while in classification, the targets are categorical.

Common examples of classification problems are

  • Labeling emails as spam or not spam
  • Person identification in a photo
  • Speech recognition
  • Whether or not a country is or will be in a recession

Classification can also be applied in settings where the target isn’t naturally categorical.

For example, suppose we want to predict whether the unemployment rate for a state will be low ($ <3\% $), medium ($ \in [3\%, 5\%] $), or high ($ >5\% $) but don’t care about the actual number.

Most economic problems are posed in continuous terms, so it may take some creativity to determine the optimal way to categorize a target variable so classification algorithms can be applied.

As many problems can be posed either as classification or regression, many machine learning algorithms have variants that perform regression or classification tasks.

Throughout this lecture, we will revisit some of the algorithms from the regression lecture and discuss how they can be applied in classification settings.

As we have already seen relatives of these algorithms, this lecture will be lighter on exposition and then build up to an application.

In [2]:
import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import pandas_datareader.data as web

from sklearn import (
    linear_model, metrics, neural_network, pipeline, preprocessing, model_selection
)

import matplotlib.pyplot as plt
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();

Warmup Example: Logistic Regression

We have actually already encountered a classification algorithm.

In the recidivism example, we attempted to predict whether or not an individual would commit another crime by using a combination of the assigned COMPAS score and the individual’s gender or race.

In that example, we used a logistic regression model, which is a close relative of the linear regression model from the regression section.

The logistic regression model for predicting the likelihood of recidivism using the COMPAS score as the single feature is written

$$ p(\text{recid}) = L(\beta_0 + \beta_1 \text{COMPAS} + \epsilon) $$

where $ L $ is the logistic function: $ L(x) = \frac{1}{1 + e^{-x}} $.

To get some intuition for this function, let’s plot it below.

In [3]:
x = np.linspace(-5, 5, 100)
y = 1/(1+np.exp(-x))
plt.plot(x, y)
Out[3]:
[<matplotlib.lines.Line2D at 0x7f0498bac160>]

Notice that for all values of $ x $, the value of the logistic function is always between 0 and 1.

This is perfect for binary classification problems that need to output the probability of one of the two labels.

Let’s load up the recidivism data and fit the logistic regression model.

In [4]:
data_url = "https://raw.githubusercontent.com/propublica/compas-analysis"
data_url += "/master/compas-scores-two-years.csv"

df = pd.read_csv(data_url)
df.head()

X = df[["decile_score"]]
y = df["two_year_recid"]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25)

logistic_model = linear_model.LogisticRegression(solver="lbfgs")
logistic_model.fit(X_train, y_train)

beta_0 = logistic_model.intercept_[0]
beta_1 = logistic_model.coef_[0][0]

print(f"Fit model: p(recid) = L({beta_0:.4f} + {beta_1:.4f} decile_score)")
Fit model: p(recid) = L(-1.4105 + 0.2597 decile_score)

From these coefficients, we see that an increase in the decile_score leads to an increase in the predicted probability of recidivism.

Suppose we choose to classify any model output greater than 0.5 as “at risk of recidivism”.

Then, the positive coefficient on decile_score means that there is some cutoff score above which all individuals will be labeled as high-risk.

See exercise 1 in the exercise list

Visualization: Decision Boundaries

With just one feature that has a positive coefficient, the model’s predictions will always have this cutoff structure.

Let’s add a second feature the model: the age of the individual.

In [5]:
X = df[["decile_score", "age"]]
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y, test_size=0.25, random_state=42
)

logistic_age_model = linear_model.LogisticRegression(solver="lbfgs")
logistic_age_model.fit(X_train, y_train)

beta_0 = logistic_age_model.intercept_[0]
beta_1, beta_2 = logistic_age_model.coef_[0]

print(f"Fit model: p(recid) = L({beta_0:.4f} + {beta_1:.4f} decile_score + {beta_2:.4f} age)")
Fit model: p(recid) = L(-0.8505 + 0.2470 decile_score + -0.0130 age)

Here, we see that an increase in the decile_score still leads to an increase in the predicted probability of recidivism, while older individuals are slightly less likely to commit crime again.

We’ll build on an example from the scikit-learn documentation to visualize the predictions of this model.

In [6]:
def plot_contours(ax, mod, xx, yy, **params):
    """
    Plot the decision boundaries for a classifier with 2 features x and y.

    Parameters
    ----------
    ax: matplotlib axes object
    mod: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    Z = mod.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, **params)
    return out

def fit_and_plot_decision_boundary(mod, X, y, **params):
    # fit model
    mod.fit(X, y)

    # generate grids of first two columns of X
    def gen_grid(xseries):
        if xseries.nunique() < 50:
            return sorted(xseries.unique())
        else:
            return np.linspace(xseries.min(), xseries.max(), 50)

    x1, x2 = np.meshgrid(gen_grid(X.iloc[:, 0]), gen_grid(X.iloc[:, 1]))

    # plot contours and scatter
    fig, ax = plt.subplots()
    plot_contours(ax, mod, x1, x2, **params)
    x1_name, x2_name = list(X)[:2]
    X.plot.scatter(x=x1_name, y=x2_name, color=y, ax=ax)
    ax.set_xlabel(x1_name)
    ax.set_ylabel(x2_name)

    return ax

fit_and_plot_decision_boundary(
    linear_model.LogisticRegression(solver="lbfgs"),
    X_train, y_train, cmap=plt.cm.Greys
)
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f04984a4160>

In this plot, we can clearly see the relationships we identified from the coefficients.

However, we do see that the model is not perfect, as some solid circles are in the light section and some light circles in the solid section.

This is likely caused by two things:

  1. The model inside the logistic function is a linear regression – thus only a linear combination of the input features can be used for prediction.
  2. Drawing a straight line (linear) that perfectly separates true observations from the false is impossible.

See exercise 2 in the exercise list

Model Evaluation

Before we get too far into additional classification algorithms, let’s take a step back and think about how to evaluate the performance of a classification model.

Accuracy

Perhaps the most intuitive classification metric is accuracy, which is the fraction of correct predictions.

For a scikit-learn classifier, this can be computed using the score method.

In [7]:
train_acc = logistic_age_model.score(X_train, y_train)
test_acc = logistic_age_model.score(X_test, y_test)

train_acc, test_acc
Out[7]:
(0.6534195933456562, 0.667960088691796)

When the testing accuracy is similar to or higher than the training accuracy (as it is here), the model might be underfitting. Thus, we should consider either using a more powerful model or adding additional features.

In many contexts, this would be an appropriate way to evaluate a model, but in others, this is insufficient.

For example, suppose we want to use a classification model to predict the likelihood of someone having a rare, but serious health condition.

If the condition is very rare (say it appears in 0.01% of the population), then a model that always predicts false would have 99.99% accuracy, but the false negatives could have large consequences.

Precision and Recall

In order to capture situations like that, data scientists often use two other very common metrics:

  • Precision: The number of true positives over the number of positive predictions. Precision tells us how often the model was correct when it predicted true.
  • Recall: The number of true positives over the number of actual positives. Recall answers the question, “What fraction of the positives did we get correct?”

In the rare health condition example, you may prefer a model with high recall (never misses an at-risk patient), even if the precision is a bit low (sometimes you have false positives).

On the other hand, if your algorithm filters spam emails out of an inbox, you may prefer a model with high precision so that when an email is classified as spam, it is very likely to actually be spam (i.e. non-spam messages don’t get sent to spam folder).

In many settings, both precision and recall are equally important and a compound metric known as the F1-score is used:

$$ F1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$

The F1 score is bounded between 0 and 1. It will only achieve a value of 1 if both precision and recall are exactly 1.

We can have scikit-learn produce a textual report with precision and recall.

Scikit-learn

In [8]:
report = metrics.classification_report(
    y_train, logistic_age_model.predict(X_train),
    target_names=["no recid", "recid"]
)
print(report)
              precision    recall  f1-score   support

    no recid       0.67      0.73      0.69      2940
       recid       0.64      0.57      0.60      2470

    accuracy                           0.65      5410
   macro avg       0.65      0.65      0.65      5410
weighted avg       0.65      0.65      0.65      5410

ROC and AUC

For classification algorithms, there is a tradeoff between precision and recall.

Let’s illustrate this point in the context of the logistic regression model.

The output of a logistic regression is a probability of an event or label.

To obtain a definite prediction from the algorithm, the modeler would first select a threshold parameter $ p $ such that all model outputs above the threshold are given the label of true.

As this $ p $ increases, the model must be relatively more confident before assigning a label of true.

In this case, the model’s precision will increase (very confident when applying true label), but the recall will suffer (will apply false to some true cases that had a model output just below the raised threshold).

Machine learning practitioners have adapted a way to help us visualize this tradeoff.

The visualization technique is known as the receiver operating characteristic – or more commonly used ROC – curve [1].

To understand this curve, consider two extremes choices for $ p $:

  • When $ p=1 $, we will (almost surely) never predict any observation to have a label 1. In this case, the false positive rate will be equal to 0, as will the true positive rate.
  • When $ p=0 $, we will predict that all observations always have a label of 1. The false positive rate and true positive rates will be equal to 1.

The ROC curve traces the relationship between the false positive rate (on the x axis) and the true positive rate (on the y axis) as the probability threshold $ p $ is changed.

Below, we define a function that uses scikit-learn to compute the true positive rate and false positive rates. Then we plot these rates against each other.

[1] The name “receiver operating characteristic” comes from its origin; during World War II, engineers used ROC curves to measure how well a radar signal could be properly detected from noise (i.e. enemy aircraft vs. noise).

In [9]:
def plot_roc(mod, X, y):
    # predicted_probs is an N x 2 array, where N is number of observations
    # and 2 is number of classes
    predicted_probs = mod.predict_proba(X_test)

    # keep the second column, for label=1
    predicted_prob1 = predicted_probs[:, 1]

    fpr, tpr, _ = metrics.roc_curve(y_test, predicted_prob1)

    # Plot ROC curve
    fig, ax = plt.subplots()
    ax.plot([0, 1], [0, 1], "k--")
    ax.plot(fpr, tpr)
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("ROC Curve")

plot_roc(logistic_age_model, X_test, y_test)

We can use the ROC curve to determine the optimal threshold value.

Since the output of our recidivism application model could potentially inform judicial decisions that impact the lives of individuals, we should be careful when considering a threshold value with low false positive rate vs high recall (low false negative rate).

We may choose to err on the side of low false negative rate so that when the model predicts recidivism, recidivism will likely occur – in other words, we would favor a high true positive rate even if the false positive rate is higher.

See exercise 3 in the exercise list

The ROC curve can also be used to do hyper-parameter selection for the model’s parameters.

To see how, consider a model with an ROC curve that has a single point at (0, 1) – meaning the true positive rate is 1 and false positive rate is zero or that the model has 100% accuracy.

Notice that integrating to obtain the area under the ROC curve returns a value of 1 for the perfect model.

The area under any other ROC curve would be less than 1.

Thus, we could use the area under the curve (abbreviated AUC) as an objective metric in cross-validation.

Let’s see an example.

In [10]:
predicted_prob1 = logistic_age_model.predict_proba(X)[:, 1]
auc = metrics.roc_auc_score(y, predicted_prob1)
print(f"Initial AUC value is {auc:.4f}")

help(linear_model.LogisticRegression)
Initial AUC value is 0.7057
Help on class LogisticRegression in module sklearn.linear_model.logistic:

class LogisticRegression(sklearn.base.BaseEstimator, sklearn.linear_model.base.LinearClassifierMixin, sklearn.linear_model.base.SparseCoefMixin)
 |  LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |  
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the
 |  cross-entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag', 'saga' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 'sag', 'saga' and 'lbfgs' solvers. **Note
 |  that regularization is applied by default**. It can handle both dense
 |  and sparse input. Use C-ordered arrays or CSR matrices containing 64-bit
 |  floats for optimal performance; any other input format will be converted
 |  (and copied).
 |  
 |  The 'newton-cg', 'sag', and 'lbfgs' solvers support only L2 regularization
 |  with primal formulation, or no regularization. The 'liblinear' solver
 |  supports both L1 and L2 regularization, with a dual formulation only for
 |  the L2 penalty. The Elastic-Net regularization is only supported by the
 |  'saga' solver.
 |  
 |  Read more in the :ref:`User Guide <logistic_regression>`.
 |  
 |  Parameters
 |  ----------
 |  penalty : str, 'l1', 'l2', 'elasticnet' or 'none', optional (default='l2')
 |      Used to specify the norm used in the penalization. The 'newton-cg',
 |      'sag' and 'lbfgs' solvers support only l2 penalties. 'elasticnet' is
 |      only supported by the 'saga' solver. If 'none' (not supported by the
 |      liblinear solver), no regularization is applied.
 |  
 |      .. versionadded:: 0.19
 |         l1 penalty with SAGA solver (allowing 'multinomial' + L1)
 |  
 |  dual : bool, optional (default=False)
 |      Dual or primal formulation. Dual formulation is only implemented for
 |      l2 penalty with liblinear solver. Prefer dual=False when
 |      n_samples > n_features.
 |  
 |  tol : float, optional (default=1e-4)
 |      Tolerance for stopping criteria.
 |  
 |  C : float, optional (default=1.0)
 |      Inverse of regularization strength; must be a positive float.
 |      Like in support vector machines, smaller values specify stronger
 |      regularization.
 |  
 |  fit_intercept : bool, optional (default=True)
 |      Specifies if a constant (a.k.a. bias or intercept) should be
 |      added to the decision function.
 |  
 |  intercept_scaling : float, optional (default=1)
 |      Useful only when the solver 'liblinear' is used
 |      and self.fit_intercept is set to True. In this case, x becomes
 |      [x, self.intercept_scaling],
 |      i.e. a "synthetic" feature with constant value equal to
 |      intercept_scaling is appended to the instance vector.
 |      The intercept becomes ``intercept_scaling * synthetic_feature_weight``.
 |  
 |      Note! the synthetic feature weight is subject to l1/l2 regularization
 |      as all other features.
 |      To lessen the effect of regularization on synthetic feature weight
 |      (and therefore on the intercept) intercept_scaling has to be increased.
 |  
 |  class_weight : dict or 'balanced', optional (default=None)
 |      Weights associated with classes in the form ``{class_label: weight}``.
 |      If not given, all classes are supposed to have weight one.
 |  
 |      The "balanced" mode uses the values of y to automatically adjust
 |      weights inversely proportional to class frequencies in the input data
 |      as ``n_samples / (n_classes * np.bincount(y))``.
 |  
 |      Note that these weights will be multiplied with sample_weight (passed
 |      through the fit method) if sample_weight is specified.
 |  
 |      .. versionadded:: 0.17
 |         *class_weight='balanced'*
 |  
 |  random_state : int, RandomState instance or None, optional (default=None)
 |      The seed of the pseudo random number generator to use when shuffling
 |      the data.  If int, random_state is the seed used by the random number
 |      generator; If RandomState instance, random_state is the random number
 |      generator; If None, the random number generator is the RandomState
 |      instance used by `np.random`. Used when ``solver`` == 'sag' or
 |      'liblinear'.
 |  
 |  solver : str, {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'},              optional (default='liblinear').
 |  
 |      Algorithm to use in the optimization problem.
 |  
 |      - For small datasets, 'liblinear' is a good choice, whereas 'sag' and
 |        'saga' are faster for large ones.
 |      - For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs'
 |        handle multinomial loss; 'liblinear' is limited to one-versus-rest
 |        schemes.
 |      - 'newton-cg', 'lbfgs', 'sag' and 'saga' handle L2 or no penalty
 |      - 'liblinear' and 'saga' also handle L1 penalty
 |      - 'saga' also supports 'elasticnet' penalty
 |      - 'liblinear' does not handle no penalty
 |  
 |      Note that 'sag' and 'saga' fast convergence is only guaranteed on
 |      features with approximately the same scale. You can
 |      preprocess the data with a scaler from sklearn.preprocessing.
 |  
 |      .. versionadded:: 0.17
 |         Stochastic Average Gradient descent solver.
 |      .. versionadded:: 0.19
 |         SAGA solver.
 |      .. versionchanged:: 0.20
 |          Default will change from 'liblinear' to 'lbfgs' in 0.22.
 |  
 |  max_iter : int, optional (default=100)
 |      Maximum number of iterations taken for the solvers to converge.
 |  
 |  multi_class : str, {'ovr', 'multinomial', 'auto'}, optional (default='ovr')
 |      If the option chosen is 'ovr', then a binary problem is fit for each
 |      label. For 'multinomial' the loss minimised is the multinomial loss fit
 |      across the entire probability distribution, *even when the data is
 |      binary*. 'multinomial' is unavailable when solver='liblinear'.
 |      'auto' selects 'ovr' if the data is binary, or if solver='liblinear',
 |      and otherwise selects 'multinomial'.
 |  
 |      .. versionadded:: 0.18
 |         Stochastic Average Gradient descent solver for 'multinomial' case.
 |      .. versionchanged:: 0.20
 |          Default will change from 'ovr' to 'auto' in 0.22.
 |  
 |  verbose : int, optional (default=0)
 |      For the liblinear and lbfgs solvers set verbose to any positive
 |      number for verbosity.
 |  
 |  warm_start : bool, optional (default=False)
 |      When set to True, reuse the solution of the previous call to fit as
 |      initialization, otherwise, just erase the previous solution.
 |      Useless for liblinear solver. See :term:`the Glossary <warm_start>`.
 |  
 |      .. versionadded:: 0.17
 |         *warm_start* to support *lbfgs*, *newton-cg*, *sag*, *saga* solvers.
 |  
 |  n_jobs : int or None, optional (default=None)
 |      Number of CPU cores used when parallelizing over classes if
 |      multi_class='ovr'". This parameter is ignored when the ``solver`` is
 |      set to 'liblinear' regardless of whether 'multi_class' is specified or
 |      not. ``None`` means 1 unless in a :obj:`joblib.parallel_backend`
 |      context. ``-1`` means using all processors.
 |      See :term:`Glossary <n_jobs>` for more details.
 |  
 |  l1_ratio : float or None, optional (default=None)
 |      The Elastic-Net mixing parameter, with ``0 <= l1_ratio <= 1``. Only
 |      used if ``penalty='elasticnet'`. Setting ``l1_ratio=0`` is equivalent
 |      to using ``penalty='l2'``, while setting ``l1_ratio=1`` is equivalent
 |      to using ``penalty='l1'``. For ``0 < l1_ratio <1``, the penalty is a
 |      combination of L1 and L2.
 |  
 |  Attributes
 |  ----------
 |  
 |  classes_ : array, shape (n_classes, )
 |      A list of class labels known to the classifier.
 |  
 |  coef_ : array, shape (1, n_features) or (n_classes, n_features)
 |      Coefficient of the features in the decision function.
 |  
 |      `coef_` is of shape (1, n_features) when the given problem is binary.
 |      In particular, when `multi_class='multinomial'`, `coef_` corresponds
 |      to outcome 1 (True) and `-coef_` corresponds to outcome 0 (False).
 |  
 |  intercept_ : array, shape (1,) or (n_classes,)
 |      Intercept (a.k.a. bias) added to the decision function.
 |  
 |      If `fit_intercept` is set to False, the intercept is set to zero.
 |      `intercept_` is of shape (1,) when the given problem is binary.
 |      In particular, when `multi_class='multinomial'`, `intercept_`
 |      corresponds to outcome 1 (True) and `-intercept_` corresponds to
 |      outcome 0 (False).
 |  
 |  n_iter_ : array, shape (n_classes,) or (1, )
 |      Actual number of iterations for all classes. If binary or multinomial,
 |      it returns only 1 element. For liblinear solver, only the maximum
 |      number of iteration across all classes is given.
 |  
 |      .. versionchanged:: 0.20
 |  
 |          In SciPy <= 1.0.0 the number of lbfgs iterations may exceed
 |          ``max_iter``. ``n_iter_`` will now report at most ``max_iter``.
 |  
 |  Examples
 |  --------
 |  >>> from sklearn.datasets import load_iris
 |  >>> from sklearn.linear_model import LogisticRegression
 |  >>> X, y = load_iris(return_X_y=True)
 |  >>> clf = LogisticRegression(random_state=0, solver='lbfgs',
 |  ...                          multi_class='multinomial').fit(X, y)
 |  >>> clf.predict(X[:2, :])
 |  array([0, 0])
 |  >>> clf.predict_proba(X[:2, :]) # doctest: +ELLIPSIS
 |  array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
 |         [9.7...e-01, 2.8...e-02, ...e-08]])
 |  >>> clf.score(X, y)
 |  0.97...
 |  
 |  See also
 |  --------
 |  SGDClassifier : incrementally trained logistic regression (when given
 |      the parameter ``loss="log"``).
 |  LogisticRegressionCV : Logistic regression with built-in cross validation
 |  
 |  Notes
 |  -----
 |  The underlying C implementation uses a random number generator to
 |  select features when fitting the model. It is thus not uncommon,
 |  to have slightly different results for the same input data. If
 |  that happens, try with a smaller tol parameter.
 |  
 |  Predict output may not match that of standalone liblinear in certain
 |  cases. See :ref:`differences from liblinear <liblinear_differences>`
 |  in the narrative documentation.
 |  
 |  References
 |  ----------
 |  
 |  LIBLINEAR -- A Library for Large Linear Classification
 |      https://www.csie.ntu.edu.tw/~cjlin/liblinear/
 |  
 |  SAG -- Mark Schmidt, Nicolas Le Roux, and Francis Bach
 |      Minimizing Finite Sums with the Stochastic Average Gradient
 |      https://hal.inria.fr/hal-00860051/document
 |  
 |  SAGA -- Defazio, A., Bach F. & Lacoste-Julien S. (2014).
 |      SAGA: A Fast Incremental Gradient Method With Support
 |      for Non-Strongly Convex Composite Objectives
 |      https://arxiv.org/abs/1407.0202
 |  
 |  Hsiang-Fu Yu, Fang-Lan Huang, Chih-Jen Lin (2011). Dual coordinate descent
 |      methods for logistic regression and maximum entropy models.
 |      Machine Learning 85(1-2):41-75.
 |      https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
 |  
 |  Method resolution order:
 |      LogisticRegression
 |      sklearn.base.BaseEstimator
 |      sklearn.linear_model.base.LinearClassifierMixin
 |      sklearn.base.ClassifierMixin
 |      sklearn.linear_model.base.SparseCoefMixin
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='warn', max_iter=100, multi_class='warn', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit the model according to the given training data.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix}, shape (n_samples, n_features)
 |          Training vector, where n_samples is the number of samples and
 |          n_features is the number of features.
 |      
 |      y : array-like, shape (n_samples,)
 |          Target vector relative to X.
 |      
 |      sample_weight : array-like, shape (n_samples,) optional
 |          Array of weights that are assigned to individual samples.
 |          If not provided, then each sample is given unit weight.
 |      
 |          .. versionadded:: 0.17
 |             *sample_weight* support to LogisticRegression.
 |      
 |      Returns
 |      -------
 |      self : object
 |      
 |      Notes
 |      -----
 |      The SAGA solver supports both float64 and float32 bit arrays.
 |  
 |  predict_log_proba(self, X)
 |      Log of probability estimates.
 |      
 |      The returned estimates for all classes are ordered by the
 |      label of classes.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape = [n_samples, n_classes]
 |          Returns the log-probability of the sample for each class in the
 |          model, where classes are ordered as they are in ``self.classes_``.
 |  
 |  predict_proba(self, X)
 |      Probability estimates.
 |      
 |      The returned estimates for all classes are ordered by the
 |      label of classes.
 |      
 |      For a multi_class problem, if multi_class is set to be "multinomial"
 |      the softmax function is used to find the predicted probability of
 |      each class.
 |      Else use a one-vs-rest approach, i.e calculate the probability
 |      of each class assuming it to be positive using the logistic function.
 |      and normalize these values across all the classes.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = [n_samples, n_features]
 |      
 |      Returns
 |      -------
 |      T : array-like, shape = [n_samples, n_classes]
 |          Returns the probability of the sample for each class in the model,
 |          where classes are ordered as they are in ``self.classes_``.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : boolean, optional
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : mapping of string to any
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as pipelines). The latter have parameters of the form
 |      ``<component>__<parameter>`` so that it's possible to update each
 |      component of a nested object.
 |      
 |      Returns
 |      -------
 |      self
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.BaseEstimator:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.LinearClassifierMixin:
 |  
 |  decision_function(self, X)
 |      Predict confidence scores for samples.
 |      
 |      The confidence score for a sample is the signed distance of that
 |      sample to the hyperplane.
 |      
 |      Parameters
 |      ----------
 |      X : array_like or sparse matrix, shape (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      array, shape=(n_samples,) if n_classes == 2 else (n_samples, n_classes)
 |          Confidence scores per (sample, class) combination. In the binary
 |          case, confidence score for self.classes_[1] where >0 means this
 |          class would be predicted.
 |  
 |  predict(self, X)
 |      Predict class labels for samples in X.
 |      
 |      Parameters
 |      ----------
 |      X : array_like or sparse matrix, shape (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape [n_samples]
 |          Predicted class label per sample.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.ClassifierMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Returns the mean accuracy on the given test data and labels.
 |      
 |      In multi-label classification, this is the subset accuracy
 |      which is a harsh metric since you require for each sample that
 |      each label set be correctly predicted.
 |      
 |      Parameters
 |      ----------
 |      X : array-like, shape = (n_samples, n_features)
 |          Test samples.
 |      
 |      y : array-like, shape = (n_samples) or (n_samples, n_outputs)
 |          True labels for X.
 |      
 |      sample_weight : array-like, shape = [n_samples], optional
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          Mean accuracy of self.predict(X) wrt. y.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.linear_model.base.SparseCoefMixin:
 |  
 |  densify(self)
 |      Convert coefficient matrix to dense array format.
 |      
 |      Converts the ``coef_`` member (back) to a numpy.ndarray. This is the
 |      default format of ``coef_`` and is required for fitting, so calling
 |      this method is only required on models that have previously been
 |      sparsified; otherwise, it is a no-op.
 |      
 |      Returns
 |      -------
 |      self : estimator
 |  
 |  sparsify(self)
 |      Convert coefficient matrix to sparse format.
 |      
 |      Converts the ``coef_`` member to a scipy.sparse matrix, which for
 |      L1-regularized models can be much more memory- and storage-efficient
 |      than the usual numpy.ndarray representation.
 |      
 |      The ``intercept_`` member is not converted.
 |      
 |      Notes
 |      -----
 |      For non-sparse models, i.e. when there are not many zeros in ``coef_``,
 |      this may actually *increase* memory usage, so use this method with
 |      care. A rule of thumb is that the number of zero elements, which can
 |      be computed with ``(coef_ == 0).sum()``, must be more than 50% for this
 |      to provide significant benefits.
 |      
 |      After calling this method, further fitting with the partial_fit
 |      method (if any) will not work until you call densify.
 |      
 |      Returns
 |      -------
 |      self : estimator

See exercise 4 in the exercise list

Neural Network Classifiers

The final classifier we will visit today is a neural-network classifier, using the multi-layer perceptron network architecture.

Recall from the regression chapter that a multi-layer perceptron is comprised of a series of nested linear regressions separated by non-linear activation functions.

The number of neurons (size of weight matrices and bias vectors) in each layer were hyperparameters that could be chosen by modeler, but for regression, the last layer had to have exactly one neuron which represented the single regression target.

To use the MLP for classification tasks, we need to make three adjustments:

  1. Construct a final layer with $ N $ neurons instead of 1, where $ N $ is the number of classes in the classification task.
  2. Apply a softmax function on the network output.
  3. Use the cross-entropy loss function instead of the MSE to optimize network weights and biases.

The softmax function applied to a vector $ x \in \mathbb{R}^N $ is computed as

$$ \sigma(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}} $$

In words, the softmax function is computed by exponentiating all the values, then dividing by the sum of exponentiated values.

The output of the softmax function is a probability distribution (all non-negative and sum to 1) weighted by the relative value of the input values.

Finally, the cross entropy loss function for $ M $ observations $ y $, with associated softmax vectors $ z $ is

$$ -\frac{1}{M} \sum_{j=1}^M \sum_{i=1}^N 1_{y_j = i} log\left(z_{i,j}\right) $$

where $ 1_{y_j = i} $ is an indicator variable with the value of 1 if the observed class was equal to $ i $ for the $ j $ th observation, 0 otherwise.

All the same tradeoffs we saw when we used the multi-layer perceptron for regression will apply for classification tasks.

This includes positives like automated-feature enginnering and theoretically unlimited flexibility.

It also includes potential negatives, such as a risk of overfitting, high computational expenses compared to many classification algorithms, and lack of interpretability.

For a more detailed discussion, review the regression lecture.

See exercise 5 in the exercise list

Aside: Neural Network Toolboxes

Thus far, we have been using the routines in scikit-learn’s neural_network package.

These are great for learning and exploratory analysis, as we have been doing, but are rarely used in production or real-world settings.

Why? 1) The scikit-learn routines do not leverage modern hardware like GPUs, so performance is likely much slower than it could be. 2) The routines only provide implementations of the most basic deep neural networks.

If you were to use neural networks in mission-critical situations, you would want to use modern neural network libraries such as Google’s tensorflow, Facebook’s pytorch, the Amazon-supported MXNet, or fastai.

Each of these toolkits has its own relative strengths and weaknesses, but we’ve seen tensorflow and pytorch used the most.

Thankfully, they all support Python as either the only or the primary point of access, so you will be well-prepared to start using them.

Application: Predicting US Recessions

Let’s apply our new classification algorithm knowledge and use leading indicators to predict recessions in the US economy.

A leading indicator is a variable that moves or changes before the rest of the economy.

Many different leading indicators have been proposed – we’ll use a few of them.

We won’t explicitly prove that these variables are actually leading indicators, but will show a plot of each variables that lets us visually inspect the hypothesis.

Data Prep

Let’s first gather the data from FRED.

In [11]:
start = "1974-01-01"
end = datetime.date.today()

def pct_change_on_last_year(df):
    "compute pct_change on previous year, assuming quarterly"
    return (df - df.shift(4))/df.shift(4)

def get_indicators_from_fred(start=start, end=end):
    """
    Fetch quarterly data on 6 leading indicators from time period start:end
    """
    # yield curve, unemployment, change in inventory, new private housing permits
    yc_unemp_inv_permit = (
        web.DataReader(["T10Y2Y", "UNRATE", "CBIC1", "PERMIT"], "fred", start, end)
        .resample("QS")
        .mean()
    )

    # percent change in housing prices and retail sales
    hpi_retail = (
        web.DataReader(["USSTHPI", "SLRTTO01USQ661S"], "fred", start, end)
        .resample("QS")  # already quarterly, adjusting so index is same
        .mean()
        .pipe(pct_change_on_last_year)
        .dropna()
    )

    indicators = (
        yc_unemp_inv_permit
        .join(hpi_retail)
        .dropna()
        .rename(columns=dict(
            USSTHPI="pct_change_hpi",
            T10Y2Y="yield_curve",
            UNRATE="unemp",
            CBIC1="inventory",
            SLRTTO01USQ661S="retail_sales",
            PERMIT="house_permits"
        ))
    )

    return indicators

indicators = get_indicators_from_fred()

indicators.head()
Out[11]:
yield_curve unemp inventory house_permits pct_change_hpi retail_sales
DATE
1976-04-01 0.801364 7.566667 61.087 1171.333333 0.071849 0.056936
1976-07-01 1.099688 7.733333 55.190 1345.000000 0.086474 0.037207
1976-10-01 1.467377 7.766667 20.163 1489.000000 0.079429 0.047523
1977-01-01 1.332222 7.500000 34.343 1562.000000 0.104979 0.037939
1977-04-01 1.248254 7.133333 51.245 1693.333333 0.109482 0.032931

Now, we also need data on recessions.

In [12]:
def get_recession_data():
    recession = (
        web.DataReader(["USRECQ"], "fred", start, end)
        .rename(columns=dict(USRECQ="recession"))
        ["recession"]
    )

    # extract start and end date for each recession
    start_dates = recession.loc[recession.diff() > 0].index.tolist()
    if recession.iloc[0] > 0:
        start_dates = [recession.index[0]] + start_dates

    end_dates = recession.loc[recession.diff() < 0].index.tolist()

    if len(start_dates) != len(end_dates):
        raise ValueError("Need to have same number of start/end dates!")

    return recession, start_dates, end_dates

recession, start_dates, end_dates = get_recession_data()

Now, let’s take a look at the data we have.

In [13]:
def add_recession_bands(ax):
    for s, e in zip(start_dates, end_dates):
        ax.axvspan(s, e, color="grey", alpha=0.2)

axs = indicators.plot(subplots=True, figsize=(8, 6), layout=(3, 2), legend=False)

for i, ax in enumerate(axs.flatten()):
    add_recession_bands(ax)
    ax.set_title(list(indicators)[i])

fig = axs[0, 0].get_figure()
fig.tight_layout();

For each of the chosen variables, you can see that the leading indicator has a distinct move in periods leading up to a recession (noted by the grey bands in background).

See exercise 6 in the exercise list

How Many leads?

If the variables we have chosen truly are leading indicators, we should be able to use leading values of the variables to predict current or future recessions.

A natural question is: how many leads should we include?

Let’s explore that question by looking at many different sets of leads.

In [14]:
def make_train_data(indicators, rec, nlead=4):
    return indicators.join(rec.shift(nlead)).dropna()

def fit_for_nlead(ind, rec, nlead, mod):
    df = make_train_data(ind, rec, nlead)
    X = df.drop(["recession"], axis=1).copy()
    y = df["recession"].copy()

    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

    mod.fit(X_train, y_train)
    cmat = metrics.confusion_matrix(y_test, mod.predict(X_test))
    return cmat

mod = pipeline.make_pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression(solver="lbfgs")
)

cmats = dict()
for nlead in range(1, 11):
    cmats[nlead] = np.zeros((2, 2))
    print(f"starting for {nlead} leads")
    for rep in range(200):
        cmats[nlead] += fit_for_nlead(indicators, recession, nlead, mod)

    cmats[nlead] = cmats[nlead] / 200

for k, v in cmats.items():
    print(f"\n\nThe average confusion matrix for {k} lag(s) was:\n {v}")
starting for 1 leads
starting for 2 leads
starting for 3 leads
starting for 4 leads
starting for 5 leads
starting for 6 leads
starting for 7 leads
starting for 8 leads
starting for 9 leads
starting for 10 leads


The average confusion matrix for 1 lag(s) was:
 [[37.705  0.65 ]
 [ 1.545  3.1  ]]


The average confusion matrix for 2 lag(s) was:
 [[37.36   1.055]
 [ 2.725  2.505]]


The average confusion matrix for 3 lag(s) was:
 [[37.99   0.935]
 [ 2.745  2.62 ]]


The average confusion matrix for 4 lag(s) was:
 [[37.925  0.675]
 [ 2.7    2.345]]


The average confusion matrix for 5 lag(s) was:
 [[37.635  0.805]
 [ 3.16   1.4  ]]


The average confusion matrix for 6 lag(s) was:
 [[36.98   1.195]
 [ 3.435  1.39 ]]


The average confusion matrix for 7 lag(s) was:
 [[36.525  1.22 ]
 [ 3.635  1.62 ]]


The average confusion matrix for 8 lag(s) was:
 [[36.275  1.245]
 [ 3.68   1.8  ]]


The average confusion matrix for 9 lag(s) was:
 [[35.405  1.635]
 [ 4.295  1.665]]


The average confusion matrix for 10 lag(s) was:
 [[34.6    1.665]
 [ 4.695  1.04 ]]

From the averaged confusion matrices reported above, we see that the model with only one leading period was the most accurate.

After that was the model with 4 leading quarters.

Depending on the application, we might favor a model with higher accuracy or one that gives us more time to prepare (the 4 quarter model).

Why did the 1-lead and 4-lead models perform better than models with another number of leads? Perhaps because different variables start moving a different number of periods before the recession hits.

The exercise below asks you to explore this idea.

See exercise 7 in the exercise list

See exercise 8 in the exercise list

Exercises

Exercise 1

Determine the level of this cutoff value. Recall that the COMPAS score takes on integer values between 1 and 10, inclusive.

What happens to the cutoff level of the decile_score when you change the classification threshold from 0.5 to 0.7? What about 0.3? Remember this idea -- we'll come back to it soon.

(back to text)

Exercise 2

Experiment with different pairs of features to see which ones show the clearest decision boundaries.

Feed different X DataFrames into the fit_and_plot_decision_boundary function above.

(back to text)

Exercise 3

Use the metrics.roc_curve function to determine an appropriate value for the probability threshold, keeping in mind our preference for high precision over high recall.

The third return value of metrics.roc_curve is an array of the probability thresholds (p) used to compute each false positive rate and true positive rate.

To do this problem, you may wish to do the following steps:

  • Concoct objective function in terms of the fpr and tpr.
  • Evaluate the objective function using the fpr and tpr variables returned by the metrics.roc_curve function.
  • Use np.argmin to find the index of the smallest value of the objective function.
  • Extract the value at the margin index from the probability threshold values array.

Hint: If we cared about both precision and recall equally (we don't here), we might choose (fpr - tpr)**2 as one objective function. With this objective function, we would find the probability threshold value that makes the false positive and true positive rates as equal as possible.

In [15]:
# your code here

(back to text)

Exercise 4

The LogisticRegression class with default arguments implements the regression including l2 regularization (it penalizes coefficient vectors with an l2-norm).

The regularization strength is controlled by a parameter C that is passed to the LogisticRegression constructor.

Smaller values of C lead to stronger regularization.

For example, LogisticRegression(C=10) would have weaker regularization than LogisticRegression(C=0.5).

Your task here is to use the model_selection.cross_val_score method to select an optimal level for the regularization parameter C. The scoring argument should be set to roc_auc.

Refer to the example in the recidivism lecture for how to use model_selection.cross_val_score.

In [16]:
# your code here

(back to text)

Exercise 5

Use a multi-layer perceptron in our recidivism example via the neural_network.MLPClassifier class.

Experiment with different inputs such as:

  • The features to include
  • The number of layers and number of neurons in each layer
  • The l2 regularization parameter alpha
  • The solver

See if you can create a model that outperforms logistic regression.

Keep in mind other things, like the degree of overfitting and time required to estimate the model parameters. How do these compare to logistic regression?

In [17]:
# your code here

(back to text)

Exercise 6

Let's pause here to take a few minutes and digest.

If the task is to use these leading indicators to predict a recession, would high recall or high precision be more important for our model?

Would your answer change if you worked at the Federal Reserve?

What if you worked at a news company such as the Economist or the New York Times?

(back to text)

Exercise 7

Extend the logic from the previous example and allow a different number of leading periods for each variable.

How would you find the "optimal" number of leads for each variable? How could you try to avoid overfitting?

Use make_train_data_varying_leads function below to construct your model.

In [18]:
def make_train_data_varying_leads(indicators, rec, nlead):
    """
    Apply per-indicator leads to each indicator and join with recession data

    Parameters
    ----------
    indicators: pd.DataFrame
        A DataFrame with timestamps on index and leading indicators as columns

    rec: pd.Series
        A Series indicating if the US economy was in a recession each period

    nlead: dict
        A dictionary which maps a column name to a positive integer
        specifying how many periods to shift each indicator. Any
        indicator not given a key in this dictionary will not be
        included in the output DataFrame.

    Returns
    -------
    df: pd.DataFrame
        A DataFrame with the leads applied and merged with the recession
        indicator

    Example
    -------

    ```
    df = make_train_data_varying_leads(
        indicators,
        recession,
        nlead=dict(yield_curve=3, unemp=4)
    )

    df.shape[1]  # == 3 (yield_curve, unemp, recession))
    ```

    """
    cols = []
    for col in list(indicators):
        if col in nlead:
            cols.append(indicators[col].shift(-nlead[col]))

    X = pd.concat(cols, axis=1)
    return X.join(rec).dropna()

# your code here!

(back to text)

Exercise 8

Experiment with different classifiers. Which ones perform better or worse?

How accurate can you become for each accuracy metric (accuracy, precision, and recall)?

(back to text)