Systems Biology

Find the “best” representative

Assume we have a vector of $$n$$ values $𝐲=\{y_1, y_2, …, y_n \}$ If we want to describe the set $$𝐲$$ with a single number $$x$$, which would it be?

If we have to replace each one of $$y_i$$ for a single number, which number is “the best”?

Better choose one that is the “less wrong”

How can $$x$$ be wrong?

How can $$x$$ be wrong?

Many alternatives to measure the error

• Number of times that $$x≠y_i$$
• Sum of absolute value of error
• Sum of the square of error

and maybe other ways

Today we will use the square of the error

Squared error

The squared error when $$x$$ represents $$𝐲$$ is $\mathrm{SE}(x)=\sum_i (y_i-x)^2$ Which $$x$$ minimizes the squared error?

Minimizing SE using geometry

We can write \begin{aligned} \mathrm{SE}(x)&=\sum_i (y_i-x)^2 =\sum_i (y_i^2 - 2y_ix + x^2)\\ &=\sum_i y_i^2 - \sum_i 2 y_ix + \sum_i x^2\\ &=\sum_i y_i^2 - x\sum_i 2 y_i + n x^2\\ \end{aligned}

This is a second degree expression, corresponding to a parabola

Parabola

We have $\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c$ which has the form of $$ax^2+ bx + c$$

Let’s explore it in Geogebra

Roots of a second degree equation

When we have $$ax^2+ bx + c =0$$ then the two roots are \begin{aligned} x_1 &= \frac{-b-\sqrt{b^2-4ac} }{2a}\\ x_2 &= \frac{-b+\sqrt{b^2-4ac} }{2a} \end{aligned} and the middle point is $\frac{x_1 + x_2}{2} = \frac{-b}{2a}$

Replacing the values

We have $\mathrm{SE}(x) =\underbrace{n}_a x^2 - \underbrace{\sum_i 2 y_i}_b \, x+ \underbrace{\sum_i y_i^2}_c$ so the center point is $\frac{-b}{2a}=\frac{\sum_i 2 y_i}{2n}=\frac{\sum_i y_i}{n}$

Arithmetic Mean: minimum squared error

We get the minimum squared error when $$x$$ is the mean

The arithmetic mean of $$𝐲$$ is $\text{mean}(𝐲) = \frac{1}{n}\sum_{i=1}^n y_i$ where $$n$$ is the size of the set $$𝐲$$.

Sometimes it is written as $$\bar{𝐲}$$

This value is usually called mean, sometimes average

Alternative: using calculus

Squared Error is a function

A function is a rule that takes a number and gives another number

In this case $$\mathrm{SE}(β)$$ takes $$β$$ and returns the squared error

Straight tangent lines

The red and blue lines corresponds to equations like $y=ax+b$ where

• $$a$$ is the slope
• $$b$$ is the place where the line intercepts the y-axis

This is called equation of the straight line or linear equation

Each position has a slope

For any value $$β$$ we can find the slope of $$\mathrm{SE}$$ at position $$β$$

This is called the derivative of $$\mathrm{SE}$$

Some simple cases

• derivative of $$a⋅β$$ is $$a$$
• derivative of a constant is 0
• derivative of $$β^2$$ is $$2β$$
• derivative of $$β^n$$ is $$n⋅β^{n-1}$$

In general, we can use Wolfram Alpha (https://www.wolframalpha.com/)

We focus in the idea, not in the technique

To find the smallest value we use derivatives

To find the value of $$β$$ that minimizes $$\mathrm{SE}(β)$$ we

• Calculate the derivative of $$\mathrm{SE}(β)$$, written as $\frac{d\mathrm{SE}}{dβ}(β)$

• Find $$β$$ such that the derivative is zero. That is, solve $\frac{d\mathrm{SE}}{dβ}(β)=0$

That is how we find the mean

We have $$\mathrm{SE}(β)=\sum_i (y_i-β)^2$$. The derivative is $\frac{d}{dβ} \mathrm{SE}(β)= 2\sum_i (y_i - β)= 2\sum_i y_i - 2nβ$ Then we need to find $$β$$ such that $2\sum_i y_i - 2nβ = 0$

Solving for the best $$β$$

The equation we want to solve is $2\sum_i y_i - 2nβ = 0$

The smallest squared error is obtained when $β = \frac{1}{n} \sum_i y_i$

Variance and covariance

Sample variance formula

\begin{aligned} \mathrm{var}(𝐲)&=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i (y_i^2-2\bar{𝐲}y_i+ \bar{𝐲}^2)\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\frac 1 n \sum_i y_i+ \bar{𝐲}^2\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}\bar{𝐲}+ \bar{𝐲}^2\frac 1 n n\\ &=\frac 1 n \sum_i y_i^2-2\bar{𝐲}^2+ \bar{𝐲}^2\\ &=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2\\ \end{aligned}

To remember

$\mathrm{var}(𝐲)=\frac 1 n \sum_i (y_i-\bar{𝐲})^2=\frac 1 n \sum_i y_i^2-\bar{𝐲}^2$

“The average of the squares minus the square of the average”

Sum of two vectors

\begin{aligned} \mathrm{var}(𝐱+𝐲)&=\frac 1 n \sum_i (x_i+ y_i-\bar{𝐱}-\bar{𝐲})^2\\ &=\frac 1 n \sum_i ((x_i-\bar{𝐱})+ (y_i-\bar{𝐲}))^2\\ &=\frac 1 n \sum_i \left((x_i-\bar{𝐱})^2 +(y_i-\bar{𝐲})^2+ 2(x_i-\bar{𝐱})(y_i-\bar{𝐲})\right)\\ &=\frac 1 n \sum_i (x_i-\bar{𝐱})^2 +\frac 1 n \sum_i (y_i-\bar{𝐲})^2+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})\\ &=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲}) \end{aligned}

What is this extra term?

The expression $\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})$ is called covariance of $$𝐱$$ and $$𝐲$$

We write it as $\mathrm{cov}(𝐱,𝐲)$

Then the variance of the sum is

$\mathrm{var}(𝐱+𝐲)=\mathrm{var}(𝐱) +\mathrm{var}(𝐲)+ 2\mathrm{cov}(𝐱,𝐲)$

The variance of the sum is the sum of the variances plus twice the covariance

Alternative expression

\begin{aligned} \frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})&=\frac 1 n \sum_i (x_i y_i-\bar{𝐱}y_i+x_i\bar{𝐲}-\bar{𝐱}\bar{𝐲})\\ &=\frac 1 n \sum_i x_i y_i-\frac 1 n \sum_i\bar{𝐱}y_i-\frac 1 n \sum_i x_i\bar{𝐲}+\frac 1 n \sum_i\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\frac 1 n \sum_i y_i - \bar{𝐲}\frac 1 n \sum_i x_i + \bar{𝐱}\bar{𝐲}\frac 1 n \sum_i 1\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}- \bar{𝐱}\bar{𝐲}+\bar{𝐱}\bar{𝐲}\\ &=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}\\ \end{aligned}

Covariance

$\mathrm{cov}(𝐱,𝐲)=\frac 1 n \sum_i (x_i-\bar{𝐱})(y_i-\bar{𝐲})=\frac 1 n \sum_i x_i y_i-\bar{𝐱}\bar{𝐲}$

The second formula is easier to calculate

“The average of the products minus the product of the averages”

Interpretation of Covariance

If $$𝐱$$ and $$𝐲$$ go in the same direction,
then the covariance is positive

If $$𝐱$$ and $$𝐲$$ go in oposite directions,
then the covariance is negative

Covariance under change of scale

It is easy to see that, for any constants $$a$$ and $$b$$, we have \begin{aligned} \mathrm{cov}(a\, 𝐱,𝐲)&=a\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(𝐱, b\,𝐲)&=b\, \mathrm{cov}(𝐱,𝐲)\\ \mathrm{cov}(a\, 𝐱, b\,𝐲)&=ab\, \mathrm{cov}(𝐱,𝐲)\\ \end{aligned} It would be nice to have a “covariance” value that is independent of the scale

Correlation

One way to be independent of the scale is to use $\mathrm{corr}(𝐱,𝐲)=\frac{\mathrm{cov}(𝐱,𝐲)}{\mathrm{sdev}(𝐱)\mathrm{sdev}(𝐲)}$ This is the correlation between $$𝐱$$ and $$𝐲$$

It is always a value between -1 and 1

Squared error of a straight line

$SE(β_0, β_1) = \sum_i (y_i - β_0 - β_1 x_i)^2$ This time we need two derivatives \begin{aligned} \frac{d}{dβ_0} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)\\ \frac{d}{dβ_1} SE(β_0, β_1) &= 2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i \end{aligned} Each one must be equal to 0

First equation

The first equation to solve is $$\frac{d}{dβ_0} SE(β_0, β_1) = 0$$

That is, we look for $$β_0$$ such that $2\sum_i (y_i - β_0 - β_1 x_i) = 0$ We can divide by 2 and expand the parenthesis $\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0$

First solution

If $$\sum_i y_i - \sum_i β_0 - \sum_i β_1 x_i = 0$$ then $\sum_i y_i = n\cdot β_0 + β_1\sum_i x_i$ Therefore, dividing by $$n$$, we have $\overline{𝐲} =β_0 + β_1 \overline{𝐱}$ In other words, we have $β_0 = \overline{𝐲} - β_1 \overline{𝐱}$

Second equation

We want to solve $$\frac{d}{dβ_1} SE(β_0, β_1) = 0$$

That is, we want to find $$β_1$$ such that $2\sum_i (y_i - β_0 - β_1 x_i)⋅x_i = 0$ Dropping the 2 and expanding the parenthesis we have $\sum_i x_i y_i - \sum_i β_0 x_i - \sum_i β_1 x_i^2 = 0$

Tidying up

We have $\sum_i x_iy_i - β_0\sum_i x_i - β_1\sum_i x_i^2 = 0$ It is convenient to divide everything by $$n$$ \begin{aligned} \frac 1 n \sum_i x_iy_i - β_0\frac 1 n \sum_i x_i - β_1\frac 1 n \sum_i x_i^2 &= 0\\ \frac 1 n \sum_i x_iy_i - β_0\overline{𝐱}- β_1 \frac 1 n \sum_i x_i^2 &=0\\ \end{aligned}

Replacing $$β_0$$

Since $$β_0 = \overline{𝐲} - β_1 \overline{𝐱}$$ we have \begin{aligned} \frac 1 n \sum_i x_iy_i - (\overline{𝐲} - β_1 \overline{𝐱}) \overline{𝐱} - β_1\frac 1 n \sum_i x_i^2 &=0\\ \frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲} + β_1 \overline{𝐱}^2 - β_1\frac 1 n \sum_i x_i^2 &=0\\ \left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) &=0\\ \end{aligned}

We have seen this before

The best $$β_1$$ is the solution of $\left(\frac 1 n \sum_i x_iy_i - \overline{𝐱}\overline{𝐲}\right) - β_1 \left( \frac 1 n \sum_i x_i^2 - \overline{𝐱}^2\right) =0$ We have seen these formulas last class $\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0$

Solution

If $\text{cov}(𝐱, 𝐲) - β_1 \text{var}(𝐱) =0$ Then the best $$β_1$$ is $β_1 = \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)}$

Summary

The best straight line is

$y = β_0 + β_1 x$ where \begin{aligned} β_0 &= \overline{𝐲} - β_1 \overline{𝐱}\\ β_1 &= \frac{\text{cov}(𝐱, 𝐲)}{\text{var}(𝐱)} \end{aligned}