Blog of Andrés Aravena
Bioinfo:

Homework 3

06 October 2022. Deadline: Friday, 14 October, 9:00. by Andrés Aravena, Ph.D.

This week we have few mandatory questions, some bonus questions, and optional mathematics for your amusement. Answer to bonus questions are optional, and give extra score if they are right. You do not loose score if they are wrong. Nothing to loose, so it is worth trying.

Homework

  1. Write an Entrez query to get all 16S nucleotide sequences from E.coli with length at least 1400 basepairs.

  2. Write an Entrez query to get all complete Globin protein sequences. The sequence length should be between 200 and 1000 amino acids. The title should not contain the words “partial” nor “domain-containing”.

  3. Make a Hamming distance calculator in Excel or Google Sheets.

  4. How many comparisons do you need to calculate the Hamming distance between all genetic codes?

  5. Prepare a DotPlot in Excel or Google Sheets. Use it to compare the following sequences.

    • ABCDEFGHIJKLMNOPQRSUTUVWXYZ
    • ABCDEAFGNIJKLOPQRSYTUVWXXZ
  6. Use the previous answer to find the Levenstein distance between the two sequences.

Mathematical definition of “distance”

(This part is optional, it is useful in life, but it is not necessary for this course.)

Distance is a function taking pairs of objects and returning a number. It is the length of the shortest path between two points.

Notice that the “shortest path” depends on what are the allowed movements. For instance, what is the distance between our campus and Taxim square?

To be a distance, a function \(d\) needs to obey the following rules

Bonus questions

  1. Can you see why the last property is called triangular inequality?

  2. Can you prove that “Hamming distance” is indeed a “distance”, according to the definition given above?

  3. Write a hamming_dist(x,y) function in R, Python, or any other computer programming language.

  4. Use the previous answer to calculate the distance between all genetic codes. The answer should be a matrix with one row and one column for each genetic code.

  5. Write a script using edirect tools or any other library like rentrez to download the first 40 amino acids of each protein in question (2).

Deadline: Friday, 14 October, 9:00.

Originally published at https://anaraven.bitbucket.io/blog/2022/bioinfo/homework03.html