6 Statistical Inference

This section is still in development.

Definition 6.1 (Population) A population is a set of similar items or events which is of interest for some question or experiment.

Definition 6.2 (Sample) A sample is a set of individuals or objects collected or selected from a statistical population by a defined procedure.

6.1 Finite Sample Distributions

Definition 6.3 (Convergence in Distribution) A sequence \(\displaystyle X_n\) of random variables is said to converge in distribution to a random variable \(X\) if \[ \lim_{n\to\infty} F_{X_n}(x) = F_X(x) \] for every \(x \in \mathscr{R}\).

For random vectors \(\left\{ X_1, X_2, \dots \right\} \subset \mathscr{R}^k\), we say that this sequence converges in distribution to a random \(k\)-vector \(X\) if \[ \lim_{n\to\infty} P(X_n\in A) = P(X\in A) \] for every \(A \subset \mathscr{R}^k\) which is a continuity set of \(X\).

Definition 6.4 (Convergence in Probability) A sequence \(\displaystyle X_n\) of random variables converges in probability towards the random variable \(X\) if \[ \lim_{n\to\infty} P\left( \left\lvert X_n - X \right\rvert > \varepsilon \right) = 0 \] for all \(\varepsilon > 0\).

Definition 6.5 (Convergence in Mean) Given a real number \(r \geq 1\), we say that the sequence \(X_n\) converges in the \(r\)th mean or in the \(L^r\) norm towards the random variable \(X\) if the \(r\)th absolute moments of \(X_n\) and \(X\) exist, and \[ \lim_{n\to\infty} E\left( \left\lvert X_n - X \right\rvert^r \right) = 0. \]

Theorem 6.1 (Law of Large Numbers) Let \(X_1, X_2, \dots\) be independent and identically distributed random variables with finite mean \(\mu\). Let \(\bar{X}_n\) be the average of the first \(n\) variables, then the law of large numbers establishes that:

  1. Weak \[ \bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i \underset{n\to\infty}{\longrightarrow} \mu \]
  2. Weak \[ P\left[ \left\lvert \bar{X}_n - \mu \right\rvert > \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 0, \quad \forall \varepsilon > 0 \]
  3. Weak \[ P\left[ \left\lvert \bar{X}_n - \mu \right\rvert < \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 1, \quad \forall \varepsilon > 0 \]
  4. Strong \[ P\left[ \left\{ \omega \in \Omega : \bar{X}_n(\omega) \underset{n\to\infty}{\longrightarrow} \mu \right\} \right] = 1 \]

Theorem 6.2 (Central Limit Theorem) Suppose \(\left\{X_1, \dots, X_n\right\}\) is a sequence of independent and identically distributed random variables with \(E[X_i] = \mu\) and \(\text{Var}[X_i] = \sigma^2 < \infty\). Then, as \(n\) approaches infinity, the random variables \(\sqrt{n}\left( \bar{X}_n - \mu \right)\) converge in distribution to a normal \(\mathscr{N}(0, \sigma^2)\). \[ \text{i.e.}\quad \sqrt{n}\left( \bar{X}_n - \mu \right) \overset{d}{\longrightarrow}\mathscr{N}(0, \sigma^2) \]

Let \(X_1, \dots, X_n\) be independent and identically distributed random variables drawn according to some distribution \(P_\theta\) parametrized by \(\theta = (\theta_1, \dots, \theta_m) \in \Theta\), where \(\Theta\) is the set of all possible parameters for the selected distribution. The goal is to find the best estimator \(\hat{\theta} \in \Theta\) such that \(\hat{\theta} \approx \theta\) since the real \(\theta\) cannot be known exactly from a finite sample.

Definition 6.6 (Estimator) An estimator \(\hat{\theta}_j\) from a parameter \(\theta_j\) is a random variable \(\hat{\theta}_j(X_1, \dots, X_n)\) that is symbolized as a function of the observed data.

Definition 6.7 (Estimate) An estimate \(\hat{\theta}_j(x_1, \dots, x_n)\) is a realization of the estimator. It is a real value for the estimated parameter.

Definition 6.8 (Bias) The bias of an estimator \(\hat{\theta}\) is defined as \[ \text{Bias}(\hat{\theta}, \theta) := E_\theta [\hat{\theta} - \theta]. \] We say that an estimator is unbiased if \(\text{Bias}_\theta [\hat{\theta}] = 0\) or \(E_\theta [\hat{\theta}] = \theta\).

Definition 6.9 (Mean Squared Error) The mean squared error of an estimator \(\hat{\theta}\) is defined as \[ \text{MSE}_\theta[\hat{\theta}] := E\left[ \left( \hat{\theta} - \theta \right)^2 \right] = \text{Var}_\theta[\hat{\theta}] + \text{Bias}^2 (\hat{\theta}, \theta) \]

Definition 6.10 (Consistency) A sequence of estimators \(\hat{\theta}^{(n)}\) of the parameter \(\theta\) is called consistent if, for any \(\varepsilon > 0\), \[ P_\theta \left[ \left\lvert \hat{\theta}^{(n)} - \theta \right\rvert > \varepsilon \right] \underset{n\to\infty}{\longrightarrow} 0. \]

An estimator is consistent only if, as the sample data increases, the estimator approaches the real parameter.

Definition 6.11 (Relative Efficiency) The relative efficiency of two estimators is defined as \[ e\left(\hat{\theta}_1, \hat{\theta}_2\right) = \frac{\text{Var}\left[\hat{\theta}_2\right]}{\text{Var}\left[\hat{\theta}_1\right]}. \] We say that \(\hat{\theta}\) is preferable if \(\text{Var}\left[\hat{\theta}_1\right] < \text{Var}\left[\hat{\theta}_2\right]\).

6.2 Point Estimation

Definition 6.12 (Likelihood Function) The likelihood function \(\mathscr{L}\) is defined as \[ \mathscr{L}(\theta;\ x_1, \dots, x_n) = f(x_1, \dots, x_n;\ \theta). \] Assuming \(x_i \perp x_j\), \(\forall i \neq j\), \[ \mathscr{L}(\theta;\ x_1, \dots, x_n) = \prod_{i=1}^n f(x_i;\ \theta). \]

For practical purposes, we often use the log-likelihood function \(\ell(\theta;\ x_1, \dots, x_n) = \log\mathscr{L}(\theta;\ x_1, \dots, x_n)\) since it is much easier to differentiate afterwards, and the maximum of \(\mathscr{L}\) is preserved for all \(\theta_j\).

Definition 6.13 (Maximum Likelihood Estimator) The maximum likelihood estimator \(\hat{\theta}\) for \(\theta\) is defined as \[ \hat{\theta} \in \left\{ \mathop{\mathrm{\arg\!\max}}_{\theta \in \Theta} \mathscr{L}(\theta;\ X_1, \dots, X_n) \right\} \]

Definition 6.14 (Score) The score is the gradient the natural logarithm of the likelihood function with respect to an \(m\)-dimensional parameter vector \(\theta\). \[ s(\theta) := \frac{\partial \log \mathscr{L}(\theta)}{\partial\theta}. \]

The score indicates the steepness of the log-likelihood function and thereby the sensitivity to infinitesimal changes to the parameter values.

Definition 6.15 (Fisher Information) Let \(f(X;\ \theta)\) be the probability density function or probability mass function for \(X\) conditioned on the value of \(\theta\). We define the Fisher information as \[ \mathscr{I}(\theta) := E\left[\left(\frac{\partial}{\partial\theta}\log f(X;\ \theta)\right)^2\right] = - E\left[\frac{\partial^2}{\partial\theta^2}\log f(X;\ \theta)\right]. \]

The Fisher information is a way of measuring the amount of information that an observable random variable \(X\) carries about an unknown parameter \(\theta\) upon which the probability of \(X\) depends.

Definition 6.16 (Cramér–Rao Bound) Suppose \(\theta\) is an unknown deterministic parameter which is to be estimated from \(n\) independent observations of \(x\), each from a distribution according to some probability function \(f(X;\ \theta)\). The variance of any unbiased estimator \(\hat{\theta}\) of \(\theta\) is then bounded by the reciprocal of the Fisher information \(\mathscr{I}(\theta)\). Namely, \[ \text{Var}\left[\hat{\theta}\right] \geq \frac{1}{n\mathscr{I}(\theta)}. \]

Definition 6.17 (Efficiency) The efficiency of an unbiased estimator \(\hat{\theta}\) of a parameter \(\theta\) is defined as \[ e\left( \hat{\theta} \right) = \frac{1 / \mathscr{I}(\theta)}{\text{Var}\left[ \hat{\theta} \right]} \] where \(\mathscr{I}(\theta)\) is the Fisher information of the sample.

Proposition 6.1 \[ e\left( \hat{\theta} \right) \leq 1 \]

Remark (Maximum Likelihood Estimator Properties).

  • Asymptotically unbiased. Namely, \(\displaystyle \lim_{n\to\infty} \text{Bias}\left( \hat{\theta}_n, \theta \right) = 0\).
  • Asymptotically efficient. Namely, \(\displaystyle \lim_{n\to\infty} \text{Var}\left[ \hat{\theta} \right] = \frac{1}{n\mathscr{I}(\theta)}\).
  • Consistency.
  • \(\displaystyle \hat{\theta}_n \overset{d}{\longrightarrow}\mathscr{N}\left( \theta, \frac{1}{n\mathscr{I}(\theta)} \right)\).

6.3 Interval Estimation

6.4 Parametric Hypothesis Testing

6.5 Normality Tests