Class 12: Statistics. Control overfitting

Systems Biology

Andrés Aravena, PhD

December 14, 2023

So far we did descriptive statistics

We described one dataset and nothing more

Now we will make inferences about a larger population

𝑦 is random

We have \[y=Xβ +e\]

We assume \(X\) is not random, but \(e\) is random \[𝔼(e)=0\qquad 𝕍(e)=σ^2 I\]

(How do we calculate \(𝕍(e)\)?)

Thus \(y\) is random

The coefficients are random

In the sample we calculate \(\hat\beta,\) the random version of \(\beta\)

  • What is their expected value?

  • Variance?

  • Distribution?

This should help us prepare better experiments

Contrasts

Let \(C\) be an invertible matrix of dimension \(n\)

Let \(γ = C β\) be a combination of the coefficients

  • What can we find with them?
  • What are their statistics?
  • How do we use them in R?

Linear models are more than straight lines

  • Exponentials
  • power laws
  • Polinomials

Too many variables can fool us

Overfitting

Controlling overfitting by Penalization

Instead of minimizing \[\sum (y_i - X_i\beta)^2\] we minimize \[\sum (y_i - X_i\beta)^2 + P(n,m,…)\] where \(P\) is a penalization component, forcing “simpler” models

Examples:

  • Akaike information criterion (AIC)
  • Bayesian information criterion (BIC)
  • Ridge regression

Ridge regression

We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \beta_j^2\] so we ask the coefficients to be small

\(λ\) controls how much we restrict the coefficients

LASSO

We minimize \[\sum (y_i - X_i\beta)^2 + \lambda \sum \vert\beta_j\vert\]

That is, we use Manhattan distance instead of Euclidean

This results in more coefficients forced to be zero