Class 2: Understanding the problem

Systems Biology

Andrés Aravena, PhD

October 05, 2021

What do we want to know?

Is this diet making me loose weight?
Is this diet making rats live longer?
- and why?
Are men heavier than women?
- and why?
Is this gene involved in early heat-shock response?
What is the elasticity constant of this coil?

Most of the good questions are qualitative

It may take some time to find the ultimate question

We are used to look at the procedure, not the final outcome

Hint: ask “why do we want to know this?”

Ask “why” five times

What do we need to evaluate?

Change of weight under different diets
Change of gene expression for different diet/age
Relationship between weight and sex and height
Gene expression before and after heat-shock
Coil length under different forces

What are their dependencies?

It is important to be explicit here

This is Hooks’ law

What are their dependencies?

It is important to be explicit here

This is “are boys heavier than girls?”

What are their dependencies?

It is important to be explicit here

This is “how does diet affect gene expression”

Try to see the big picture

We may be tempted to draw this

But this limit us to a fixed gene. Better think big

What we really want is…

and we want to see inside the “?” boxes

Write as a formula

Thousands of years of experience show that it is good to write a formula

For example, coil length \(l\) depends on the applied force \(f\)

\[l(f)\]

Formulas are easier to send by email/SMS/WhatsApp

We should see the formula as another view of the drawing

Same with weight

The weight \(w\) depends on the sex \(s\) and the height \(h.\) \[w(h, s)\]

Then we can answer questions like \[Δw(h) = w(h, “Male”) - w(h, “Female”)\] and we realize that our answer depends on the height

Gene expression

Gene expression \(e\) depends on age \(a,\) diet \(d,\) and gene \(g\) \[e(a, d, g)\]

Then the change in gene expression due to diet is \[Δe(a, g) = e(a, “AL”, g) - e(a, “CR”, g)\]

We can see that \(Δe\) depends only on \(a\) and \(g\)

The relationship is true for all \(a\) and \(g\)

We want to know these functions

If we know the function inside \(e(a, d, g)\) we could answer many questions

We will try to do that later

For this class we do not care what is inside each function, just how they are related to the questions we ask

Write the formulas

They are important tools of communication

With your collaborators
With your readers
With yourself

Exercise: write the formula for this

(thanks to Elif Öztemiz)

What are we really measuring?

number in the scale at some time during the day
light intensity in a microarray
CT values for samples taken at different times
number of centimeters in a measuring tape

Give them a name

As before, let’s be explicit about the dependencies

For example, we measure weight \(w_M(h,s,r,t)\) of a person with a given height \(h\) and sex \(s\) in several replicas \(r\) using technique \(t\)

Technique here means the experimental procedure, such as the scale (weighing apparatus) used

How to go from measured to evaluated

We want to know the true relationship \(w(h,s)\)

But we cannot see it directly

We can only see the experimental results

Therefore we need to understand how they are connected

We decompose the measured value

The real value \(w\) “plus” the variability \(v\) \[w_M(h,s,r,t) = w(h,s)⊕v(h,s,r,t)\]

The ⊕ symbol may be a + or a × or something else

We will find the correct one later

For now we take it as the normal sum +

Results change every time

The value will be different for each replica and for each technique

To get rid of the technique variability, we normalize our results

Calibrating the instruments
Using positive and negative controls as references

After normalization we have…

Normalized data depends on the real value \(w\) “plus” the variability \(v\) \[w_N(h,s,r) = w(h,s)⊕v(h,s,r)\]

After normalization, all variability is random

We will see that this variability has two sources: noise and diversity

What is noise?

For a coil the variability is easy \(l_N(f,r) = l(f) + v(r)\)

The true function \(l\) is simply \(k⋅h\)

The only variability comes from the measurement error, a.k.a noise

We write \(n\) instead of \(v\) to represent noise \[l_N(f,r) = l(f) + n(r)\]

Measurement error is random

Typically noise follows a normal distribution with mean 0 \[n(r) \sim \mathcal{N}(0,σ^2)\]

The variance is a measure of the instrument quality

Better instruments have smaller \(σ^2\)

Often (but not always) the noise is independent of the value measured

Handling noise

In this case we use the classical statistical tools

For example we take the average of \(n\) replicas \[\frac{1}{n}\sum_r l_N(f,r) = l(f) + \frac{1}{n}\sum_r n(r)\]

We will find that \[\frac{1}{n}\sum_r l_N(f,r) \sim \mathcal{N}(l(f), σ^2/n)\]

We get a confidence interval

Everything that we measure has a margin of error

We should consider the margin of error on every step of the analysis

Better instruments, and technical replicas, give narrower intervals

and a narrow interval is good

This is why we have technical replicas

The instrumental noise is not avoided with normalization

The good protocol is to measure several times, and take the average

That reduces the noise level \(\sigma\)

But that may not be the most important part

Biology is harder than physics

Every individual is different, probably due to many reasons

When we measure the weight of a person, the weight depends on the biological diversity \(b\) and on the noise \(n\) \[w_N(h,s,r) = w(h,s)⊕n(r)⊕b(h,s,r)\]

The biological diversity is often much larger than the noise

And it may not follow a Normal distribution

We still can do science

The real challenge comes from the biological diversity

Even with perfect instruments (without noise), we have \[w_N(h,s,r) = w(h,s)⊕b(h,s,r)\] so \(w(h,s)\) represents the average case for our population

The average may not be very common

Summary

Measured = Real ⊕ Diversity ⊕ Instrument ⊕ Noise