Blog of Andrés Aravena

Homework 2

22 September 2019. Deadline: Friday, 4 October, 14:00.

We will explore some methods to find which parts of a text are similar to a pattern. For instance, the text can be a genome, and the pattern can be a gene or a motif, but the same ideas apply to any text and any fixed pattern.

For this, we need a dissimilarity function \[ d: \mathcal S \times \mathcal S \to \mathbb R\] such that, for all \(x, y, z \in \mathcal S,\) we have

that is, the dissimilarity is never a negative number (i.e. it is always positive or zero), the dissimilarity if anything to itself is 0 (i.e everything is similar to itself), the dissimilarity is independent of the order, and the direct comparison of two elements is always better (or the same) than comparing to a third element.

In this case the set \(\mathcal S\) will be the set of character vectors, or strings. That is, we will compare words and parts of texts, such as genes, genomes, motifs and proteins.

  1. Write a function (in any formal language), taking two strings of the same size —named x and y—, and returning the Hamming distance of the two sequences. A good name for this function is Hamming.

    Remember that the Hamming distance is just the number of letters that are different between the two strings. It is the number of mismatches. If the strings represent DNA, the Hamming distance represents the number of mutations.

  2. Write another function that takes two strings of different size —named pattern and text—, and calculates the Hamming distance between pattern and the corresponding substring of text at each location. For example in R language, if pattern length is m, then the value at location k will be

    ans[k] <- Hamming(pattern, text[k:(k+m-1)])

    If you use Python or other similar language, the indices will be a little different. They key part of this question is to find the valid range for k.

  3. Write a third function, based on the previous one, that takes two strings of different size —named pattern and text— and a number named thr, and returns a list of all locations in text whose distance to pattern is less or equal that the threshold thr.

Deadline: Friday, 4 October, 14:00.

Originally published at