There are several tools used to map reads to a genome
bwa
bowtie
bowtie2
hisat2
Each one has several options, that you can read in their manual
All of them produces SAM/BAM files
Using featureCounts
(or similar) tool, we transform the SAM file into a matrix of counts
The element \(c_{ij}\) is the count of reads for gene \(i\) under library \(j\)
This value depends on
We care about the first one. The last one depends on technical issues
We can normalize the counts by the library size
The library size, i.e. the number of reads in the library, is \[N_j = \sum_i c_{ij}\]
So the relative number of reads is \[\frac{c_{ij}}{N_j}\]
RNAseq libraries are large. They have millions of read
Thus, each \[\frac{c_{ij}}{N_j}\] is a very small number.
We scale them by \(10^6\) and get Reads Per Million \[\frac{c_{ij}⋅10^6}{N_j}\]
(same as we scale \(0.000004m\) to \(4μm\))
If the reads are taken randomly from each mRNA, then longer genes have higher probability of being sequenced
To correct that we divide by the gene length \(l_i\) \[\frac{c_{ij}⋅10^6}{N_j l_i}\]
Again, this is a small number, so we scale by \(10^3\) and we get Reads Per Kilo per Million \[RPKM_{ij} = \frac{c_{ij}⋅10^9}{N_j l_i}\]
It is easy to see that \[RPKM_{ij} = \frac{c_{ij}⋅10^6}{N_j} \frac{10^3}{l_i} = \frac{c_{ij}⋅10^3}{l_i} \frac{10^6}{N_j}\] so we can divide by the gene length first, and the library size later
When we use paired-end sequencing, we typically get two reads for each fragment
But sometimes we get only one read
It makes sense to count fragments instead of reads
This is basically handled by the counting step
FPKM is the same formula as RPKM, but counting fragments instead of reads
We want to compare at least two libraries
The sum of all RPKM in library \(j_1\) is \[\sum_i RPKM_{i{j_1}} = \frac{10^9}{N_{j_1}}\sum_i \frac{c_{ij}}{l_i}\]
and, in general, this number is different from the sum for another library \(j_2\) \[\sum_i RPKM_{i{j_1}} ≠ \sum_i RPKM_{i{j_2}}\]
That is not good for statistical tests
To avoid this issue, we can normalize by the gene length before calculating totals
Let’s call \(Q_i\) to the sum of_gene-length-corrected_counts \[Q_j = \sum_i \frac{c_{ij}}{l_i}\]
Then Transcripts per Million for gene \(i\) in library \(j\) is \[TPM_{ij}=\frac{c_{ij}}{l_i⋅Q_j}\]
Now we have \[\sum_i TPM_{ij}=\frac{1}{Q_j}\sum_i \frac{c_{ij}}{l_i}=1\]
RPKM/FPKM are good for comparing between genes in the same library
TPM is better for comparing the same gene between libraries
This tool is based on a different statistical hypothesis
Instead of assuming log-normal distribution, it assumes a negative binomial distribution
Both distributions are similar, but it allows us to see the data from another point of view
In particular, it does not require normalization of counts
sw480_APC_1 sw480_APC_2 sw480_APC_3 sw480_CONTROL_1 sw480_CONTROL_2
NM_000014 1 1 3 1 2
NM_000015 1 0 0 0 2
NM_000016 1038 1126 1098 1285 1037
NM_000017 133 111 103 171 136
NM_000018 8699 8232 8364 8888 8426
NM_000019 2094 2070 1879 2645 2248
sw480_CONTROL_3 sw480.1 sw480.2 sw480.3 cds_length gene_name
NM_000014 2 3 2 2 4653 A2M
NM_000015 1 1 3 1 1317 NAT2
NM_000016 1306 1022 1059 1057 2603 ACADM
NM_000017 174 143 147 125 1917 ACADS
NM_000018 9188 10982 10844 11450 2292 ACADVL
NM_000019 2775 2132 2164 2087 2134 ACAT1
sample_info <- data.frame(
condition=factor(rep(c("APC","CONTROL", "sw480"), each=3),
levels = c("CONTROL", "APC", "sw480")),
row.names = colnames(read.counts[,1:9]))
sample_info
condition
sw480_APC_1 APC
sw480_APC_2 APC
sw480_APC_3 APC
sw480_CONTROL_1 CONTROL
sw480_CONTROL_2 CONTROL
sw480_CONTROL_3 CONTROL
sw480.1 sw480
sw480.2 sw480
sw480.3 sw480
DESeq.ds <- DESeqDataSetFromMatrix( countData = read.counts[,1:9],
colData = sample_info, design = ~ condition)
colData(DESeq.ds)
DataFrame with 9 rows and 1 column
condition
<factor>
sw480_APC_1 APC
sw480_APC_2 APC
sw480_APC_3 APC
sw480_CONTROL_1 CONTROL
sw480_CONTROL_2 CONTROL
sw480_CONTROL_3 CONTROL
sw480.1 sw480
sw480.2 sw480
sw480.3 sw480
DataFrame with 39276 rows and 0 columns
sw480_APC_1 sw480_APC_2 sw480_APC_3 sw480_CONTROL_1 sw480_CONTROL_2
NM_000014 1 1 3 1 2
NM_000015 1 0 0 0 2
NM_000016 1038 1126 1098 1285 1037
NM_000017 133 111 103 171 136
NM_000018 8699 8232 8364 8888 8426
NM_000019 2094 2070 1879 2645 2248
sw480_CONTROL_3 sw480.1 sw480.2 sw480.3
NM_000014 2 3 2 2
NM_000015 1 1 3 1
NM_000016 1306 1022 1059 1057
NM_000017 174 143 147 125
NM_000018 9188 10982 10844 11450
NM_000019 2775 2132 2164 2087
Now we verify that we did not lose any read
[1] TRUE
Estimate a factor to correct each library
sw480_APC_1 sw480_APC_2 sw480_APC_3 sw480_CONTROL_1 sw480_CONTROL_2
1.0059487 0.9570850 0.9454599 1.0364710 0.9067061
sw480_CONTROL_3 sw480.1 sw480.2 sw480.3
1.0709679 1.0344438 1.0359450 1.0370205
It is just a factor of scale
Let’s avoid log2(0)
by adding a pseudo-count
First, gene-wise dispersion estimates across all samples
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
Then a t-test (assuming different variances)
topTable
log2 fold change (MLE): condition sw480 vs CONTROL
Wald test p-value: condition sw480 vs CONTROL
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
NM_000014 1.889930 0.433904 1.1851497 0.366118 7.14277e-01
NM_000015 0.995613 0.655316 1.6196758 0.404597 6.85774e-01
NM_000016 1111.347112 -0.251112 0.0621873 -4.038000 5.39089e-05
NM_000017 137.250467 -0.254684 0.1441359 -1.766968 7.72335e-02
NM_000018 9407.540815 0.281006 0.0384621 7.306038 2.75134e-13
NM_000019 2224.066114 -0.307231 0.0464157 -6.619114 3.61357e-11
padj
<numeric>
NM_000014 NA
NM_000015 NA
NM_000016 2.67474e-04
NM_000017 1.56518e-01
NM_000018 4.40436e-12
NM_000019 4.61581e-10
results(DESeq.ds, contrast=c("condition", "APC", "CONTROL"),
independentFiltering = TRUE, alpha = 0.05) |> head()
log2 fold change (MLE): condition APC vs CONTROL
Wald test p-value: condition APC vs CONTROL
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
NM_000014 1.889930 0.0507329 1.2335629 0.0411271 9.67195e-01
NM_000015 0.995613 -1.4665320 1.8959505 -0.7735076 4.39222e-01
NM_000016 1111.347112 -0.0982781 0.0619814 -1.5856055 1.12829e-01
NM_000017 137.250467 -0.4190118 0.1475355 -2.8400750 4.51029e-03
NM_000018 9407.540815 -0.0189505 0.0387184 -0.4894445 6.24527e-01
NM_000019 2224.066114 -0.2909752 0.0466153 -6.2420551 4.31859e-10
padj
<numeric>
NM_000014 9.83238e-01
NM_000015 5.45796e-01
NM_000016 1.73850e-01
NM_000017 9.33849e-03
NM_000018 7.18658e-01
NM_000019 1.76809e-09