Class 16: Gene expression analysis

Bioinformatics

Andrés Aravena

January 22, 2021

Measuring Gene Expression

More precisely, mRNA concentration

What is the question?

We want to know

Which genes are being expressed
How much of each gene is being expressed
How does expression change
- In time
- Under different conditions
- Between strains/mutants/cell lines

The Big Assumption

Measuring protein concentration is hard

We assume that protein concentration is proportional to mRNA concentration

Which genes are being transcribed
How much of each gene is being transcribed
How does transcription change
- In time
- Under different conditions
- Between strains/mutants/cell lines

qPCR

If you have primers for each gene

specific to each gene
thermodynamically stable
efficient

Raw data: CT value for each gene/condition
and CT value for calibration reference

Hybridization methods

Southern/Northern/Western blot can detect, but not quantify
(I think so. I’m not a biologist)

Instead, we have macro- and microarrays

Raw data: Light intensity (luminescence) in one or more wave length

This is measured in arbitrary units, and is a number between 0 and 65536
(that is, a 16-bits value)

Problem: it is hard to compare genes

Light intensity depends on hybridization rate
Hybridization depends on oligo thermodynamics
- GC content can have a big effect
secondary structure blocks hybridization

Solution

Compare each gene with itself in a different condition

In other words, evaluate differential expression

we cannot compare gene expression between genes
we can compare differential gene expression between genes

RNAseq

mRNA is retro-transcribed and fragmented.
Fragments are sequenced. Reads are aligned to reference genome

Raw data: SAM/BAM file with location of each read in the reference genome

Processed data: Number of reads per gene

Problem: it is hard to compare genes

longer genes get more reads
different conditions get different number of reads

Solution 1: Normalization

Divide by library size
Divide by gene size

Normalization: RPKM, FPKM

Reads per kilobase per million:

Count reads in a sample
- divide that number by 1E6 (“per million”)
Divide the read counts by the “per million” scaling factor
- This normalizes for sequencing depth
- We get reads per million (RPM)
Divide the RPM values by the length of the gene, in kilobases
This gives you RPKM.

FPKM is the same, but with fragments instead of reads

https://btep.ccr.cancer.gov/question/faq/what-is-the-difference-between-rpkm-fpkm-and-tpm/

Normalization: TPM

Transcripts per million: similar to RPKM and FPKM
Just a different order of operations

Divide the read counts by the length of each gene in kilobases
- This gives reads per kilobase (RPK)
Count up all the RPK values in a sample and divide this number by 1E6
- This is “per million” scaling factor.
Divide the RPK values by the “per million” scaling factor
- This gives TPM

Solution 2

Compare each gene with itself in a different condition

In other words, evaluate differential expression

we cannot compare gene expression between genes
we can compare differential gene expression between genes

Differential Expression

Let’s say we have two conditions:

wild-type cells
mutant cells

Let’s take wild type as the base condition.

Fold change

For a given gene G, we want to calculate \[\frac{\text{Expression}(M)}{\text{Expression}(WT)}\]

This would be a number between 0 and ∞

When the gene is under-expressed, we get something between 0 and 1

When it is over-expressed, we get something between 1 and ∞

Log fold change

It is better to have a symmetrical result \[\log_2 \left(\frac{\text{Expression}(M)}{\text{Expression}(WT)}\right)\]

We use base 2 to get fold-change

Positive numbers indicate over-expression
Negative numbers indicate under-expression
Zero indicates no-change

Data source: NCBI GEO

Gene Expression Omnibus

Platforms
Samples
Series
Data Set
Profile

NCBI GEO data structure

Source: https://www.ncbi.nlm.nih.gov/geo/info/overview.html

Application: gene expression analysis

Let’s analyze gene_expression.csv

Each column is a gene, each row is a sample, taken from NCBI GSE95670. The values are absolute expression, we want to know

The fold-change in expression
If that change is real, or just noise

Factors indicate the condition

The first three rows are mutants
The last three rows are wild-type, or control

We can add a column with a factor describing the condition

m$condition <- factor(
    rep(c("Mutant", "Control"), c(3, 3)),
    levels=c("Control","Mutant"))

Now we can do a model for each gene

We want to measure fold-change. That means, we want to calculate log_2_(mutant/wild type)

gene_model <- lm(log2(LOC100652730) ~ condition, data=m)
gene_model


Call:
lm(formula = log2(LOC100652730) ~ condition, data = m)

Coefficients:
    (Intercept)  conditionMutant  
         7.2767           0.2465

Statistical analysis

summary(gene_model)


Call:
lm(formula = log2(LOC100652730) ~ condition, data = m)

Residuals:
      1       2       3       4       5       6 
 0.5004 -0.2829 -0.2175  0.3705 -0.2239 -0.1466 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.2767     0.2211  32.912 5.08e-06 ***
conditionMutant   0.2465     0.3127   0.788    0.475    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.383 on 4 degrees of freedom
Multiple R-squared:  0.1344,    Adjusted R-squared:  -0.08195 
F-statistic: 0.6213 on 1 and 4 DF,  p-value: 0.4747

Confidence intervals

confint(gene_model)

                     2.5 %   97.5 %
(Intercept)      6.6628636 7.890598
conditionMutant -0.6216828 1.114595

Check if the confidence interval contains 0

Other cases

lm(log2(IZUMO4) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %   97.5 %
conditionMutant 0.5694141 3.608096

lm(log2(RHOU) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %     97.5 %
conditionMutant -1.433264 -0.2166465

lm(log2(STC1) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                  2.5 %   97.5 %
conditionMutant 4.45331 8.712368

lm(log2(PDGFC) ~ condition, data=m) %>% confint(parm = "conditionMutant")

                    2.5 %    97.5 %
conditionMutant -4.362937 -2.184466