Blog of Andrés Aravena
Course Homepage:

# Bioinformatics

## Genomics and DNA analysis

17 September 2019

The main subject is “metagenomics”. We will learn how to handle the output of DNA sequencing machines, how to assemble the chromosome, how to find genes and how to determine the probable function of the proteins they encode. If time allows, we will also study phylogenetic trees and microarray analysis.

Classes are held each week on Tuesdays 13:00–17:00 (at the Physics Dept. Computer Lab), and on Fridays at 14:00-17:00 (at Astronomy Dept. Computer Lab). Most of the practice is done on Linux servers, so you may be interested in other courses with that subject.

This page will be updated during the semester. Please check it regularly. Also, please register yourself at the course forum, at https://groups.google.com/d/forum/iu-bioinfo. You can also participate writing an email to .

# Classes

Here you find the slides that have been used in classes. Notice that usually they are not published immediately, so you better take good notes. We recommend taking notes with pen and paper using the Cornell Method.

• Searching patterns in text (Sept 20)

• Computational cost
• Sets and Logic
• relations
• distance, dissimilarity
• Edit Distance. Searching with mismatches (Sept 24)

• Probabilities. Join Probability, Bayes theorem.
• Needleman–Wunsch method for global alignment Searching with gaps (Oct 4)

• Global versus semi-global alignment. Some gaps are not bad
• Bio+Tech+noise
• Application of Bayes theorem, solution of homework using a tree diagram
• log odds–ratio
• how to assign a score to each substitution
• Class 1: Why do we care about Bioinformatics?. (Sep 17, 2019). A personal perspective on Metagenomics and Bioinformatics [Slides].

• Class 7: Scores and probabilities. (Oct 11, 2019). Statistics [Slides].

• Class 8: Understanding BLAST. (Oct 14, 2019). Using NCBI website [Slides].

• Class 12: DNA sequencing and assembly. (Nov 19, 2019). How can we know the DNA sequence? [Document].

• Class 13: Assembly Workshop. (Nov 22, 2019). Based on the extension activities of project Nucleo Milenio P01-005 “Information and Randomness” at Universidad de Chile. Original date October 14, 2005 [Document].

• Class 15: SAM, BAM & BWA. (Nov 26, 2019). Also, Summary of homework Answers [Slides].

• Class 17: Primer Design. (Dec 3, 2019). How to calculate Melting Temperature [Slides].

• Class 19: NCBI Entrez. (Dec 10, 2019). Using NCBI website [Slides].

• Class 21: Motif Finding and Identification. (Dec 17, 2019). Finding Motifs and Taxonomy Identification without alignment. [Slides].

• Class 22: Alignment free methods. (Dec 20, 2019). Finding Motifs and Taxonomy Identification without alignment. [Slides].

• Example of Position Specific Scoring Matrix on Google Sheets.

# Homework

All homework should be sent to andres.aravena+bioinfo@istanbul.edu.tr before the deadline to get a grade. Use this address only for homework and exams, since it is processed automatically and I do not see any question sent there. Send your questions to the forum instead.

• Homework 4 (Practical)
Preparation for final exam.
• Homework 3
Preparation for the midterm exam.
• Homework 2 (Deadline: Friday 4 of October at 14:00).
We will explore some methods to find which parts of a text are similar to a pattern. For instance, the text can be a genome, and the pattern can be a gene or a motif, but the same ideas apply to any text and any fixed pattern.
• Homework 1 (Deadline: Tuesday 24 of September at 13:00).
Write a function to find the location of a word in a large text.

# Bibliography

These are some of the papers we want to read and understand during this semester. The most important ones are marked in bold face. Start by reading those

If you find that the web link is wrong, or you find the missing URLs, please let me know.

## Protein Clusters

• Tatusov, R L, M Y Galperin, D A Natale, and E V Koonin. “The COG Database: A Tool for Genome-Scale Analysis of Protein Functions and Evolution.” Nucleic Acids Research 28, no. 1 (January 1, 2000): 33–36.

• Tatusov, R L, D A Natale, I V Garkavtsev, and T A Tatusova. “The COG Database: New Developments in Phylogenetic Classification of Proteins from Complete Genomes.” Nucleic Acids Research, January 1, 2001. http://nar.oxfordjournals.org/cgi/content/abstract/29/1/22.

• Tatusov, R L, N D Fedorova, J D Jackson, A R Jacobs, B Kiryutin, E V Koonin, D M Krylov, et al. “The COG Database: An Updated Version Includes Eukaryotes.” BMC Bioinformatics 4 (September 11, 2003): 41. http://www.biomedcentral.com/1471-2105/4/41.

## Assembly

• Staden, R. “A Strategy of DNA Sequencing Employing Computer Programs.” Nucleic Acids Research 6, no. 7 (1979): 2601–10. https://doi.org/10.1093/nar/6.7.2601.

• Lander, E S, and M S Waterman. “Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis.” Genomics 2, no. 3 (April 1, 1988): 231–39. https://doi.org/10.1016/0888-7543(88)90007-9.

• Pevzner, P A, H Tang, and M S Waterman. “An Eulerian Path Approach to DNA Fragment Assembly.” Proceedings of the National Academy of Sciences of the United States of America 98, no. 17 (August 14, 2001): 9748–53.

• Chaisson, M, D Brinza, and P Pevzner. “De Novo Fragment Assembly with Short Mate-Paired Reads: Does the Read Length Matter?” Genome Research, December 3, 2008, 25.

• Sims, David, Ian Sudbery, Nicholas E. Ilott, Andreas Heger, and Chris P. Ponting. “Sequencing Depth and Coverage: Key Considerations in Genomic Analyses.” Nature Reviews Genetics 15, no. 2 (2014): 121–32. https://doi.org/10.1038/nrg3642.

• Bankevich, Anton, Sergey Nurk, Dmitry Antipov, Alexey a. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, et al. “SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.” Journal of Computational Biology 19, no. 5 (2012): 455–77. https://doi.org/10.1089/cmb.2012.0021.

• Li, Zhenyu, Yanxiang Chen, Desheng Mu, Jianying Yuan, Yujian Shi, Hao Zhang, Jun Gan, et al. “Comparison of the Two Major Classes of Assembly Algorithms: Overlap-Layout-Consensus and de-Bruijn-Graph.” Briefings in Functional Genomics 11, no. 1 (2012): 25–37. https://doi.org/10.1093/bfgp/elr035.

• Nagarajan, Niranjan, and Mihai Pop. “Sequence Assembly Demystified.” Nature Reviews. Genetics 14, no. 3 (2013): 157–67. https://doi.org/10.1038/nrg3367.

• Wick, Ryan R., Mark B. Schultz, Justin Zobel, and Kathryn E. Holt. “Bandage: Interactive Visualization of de Novo Genome Assemblies.” Bioinformatics 31, no. 20 (2015): 3350–52. https://doi.org/10.1093/bioinformatics/btv383.

• Phillippy, Adam M. “New Advances in Sequence Assembly.” Genome Research 27, no. 5 (May 1, 2017): xi–xiii. https://doi.org/10.1101/gr.223057.117.

## Metagenomics

• Dina Fine Maron. “Dirty Money.” Scientific American, 2017. https://www.scientificamerican.com/article/dirty-money/.

• Jeff Leach. “Going Feral: My One-Year Journey to Acquire the Healthiest Gut Microbiome in the World,” January 2014. http://humanfoodproject.com/going-feral-one-year-journey-acquire-healthiest-gut-microbiome-world-heard/.

• Tyson, Gene W, Jarrod Chapman, Philip Hugenholtz, Eric E Allen, Rachna J Ram, Paul M Richardson, Victor V Solovyev, Edward M Rubin, Daniel S Rokhsar, and Jillian F Banfield. “Community Structure and Metabolism through Reconstruction of Microbial Genomes from the Environment.” Nature 428, no. 6978 (2004): 37–43. https://doi.org/10.1038/nature02340.

• Qin, Junjie, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam, Kristoffer Solvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen, et al. “A Human Gut Microbial Gene Catalogue Established by Metagenomic Sequencing.” Nature 464, no. 7285 (March 4, 2010): 59–65. https://doi.org/10.1038/nature08821.

• Ünal, Burcu. “Phylogenetic Analysis of Bacterial Communities in Kefir by Metagenomics.” Izmir Institute of Technology, Turkey, 2008.

• Ünal, Burcu, and Alper Arslanoğlu. “Phylogenetic Identification of Bacteria within Kefir by Both Culture-Dependent and Culture-Independent Methods.” African Journal of Microbiology Research 7, no. 36 (2013): 4533–38. https://doi.org/10.5897/AJMR2013.6064.

• Handelsman, Jo. “Metagenomics: Application of Genomics to Uncultured Microorganisms.” Microbiology and Molecular Biology Reviews 68, no. 4 (2004): 669–85. https://doi.org/10.1128/MMBR.68.4.669-685.2004.

• Baker, Brett J., and Jillian F. Banfield. “Microbial Communities in Acid Mine Drainage.” FEMS Microbiology Ecology 44, no. 2 (2003): 139–52. https://doi.org/10.1016/S0168-6496(03)00028-X.

• Wooley, John C., and Yuzhen Ye. “Metagenomics: Facts and Artifacts, and Computational Challenges.” Journal of Computer Science and Technology 25, no. 1 (2009): 71–81. https://doi.org/10.1007/s11390-010-9306-4.

• Sharpton, Thomas J. “An Introduction to the Analysis of Shotgun Metagenomic Data.” Frontiers in Plant Science 5 (June 16, 2014): 209. https://doi.org/10.3389/fpls.2014.00209.

• Hunter, Chris I, Alex Mitchell, Philip Jones, Craig McAnulla, Sebastien Pesseat, Maxim Scheremetjew, and Sarah Hunter. “Metagenomic Analysis: The Challenge of the Data Bonanza.” Briefings in Bioinformatics 13, no. 6 (November 1, 2012): 743–46. https://doi.org/10.1093/bib/bbs020.

• Teeling, Hanno, and Frank Oliver Glöckner. “Current Opportunities and Challenges in Microbial Metagenome Analysis–a Bioinformatic Perspective.” Briefings in Bioinformatics 13, no. 6 (December 1, 2012): 728–42. https://doi.org/10.1093/bib/bbs039.

• Mande, Sharmila S, Monzoorul Haque Mohammed, and Tarini Shankar Ghosh. “Classification of Metagenomic Sequences: Methods and Challenges.” Briefings in Bioinformatics 13, no. 6 (November 1, 2012): 669–81. https://doi.org/10.1093/bib/bbs054.

## Motifs

• Bailey, T. L, and C. Elkan. “Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Bipolymers.” Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, 1994, 28–36. https://doi.org/citeulike-article-id:878292.

• Eskin, E, M S Gelfand, and P Pevzner. “Genome Wide Analysis of Bacterial Promoter Regions.” Pacific Symposium on Biocomputing 2003: Kauai, Hawaii, 3-7 January 2003, 2002, 29.

## Others

• Sears, David B. “The Computational Linguistics of Biological Sequences.” In ARTIFICIAL INTELLIGENCE & MOLECULAR BIOLOGY W1·2, 47–121, 2002.

• Subramanian, A, P Tamayo, V K Mootha, S Mukherjee, B L Ebert, M A Gillette, A Paulovich, et al. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences of the United States of America 102, no. 43 (October 25, 2005): 15545–50.

• Reshef, D. N., Y. a. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti. “Detecting Novel Associations in Large Data Sets.” Science 334, no. 6062 (2011): 1518–24. https://doi.org/10.1126/science.1205438.

• Yates, Andrew, Kathryn Beal, Stephen Keenan, William McLaren, Miguel Pignatelli, Graham R.S. Ritchie, Magali Ruffier, Kieron Taylor, Alessandro Vullo, and Paul Flicek. “The Ensembl REST API: Ensembl Data for Any Language.” Bioinformatics 31, no. 1 (2015): 143–45. https://doi.org/10.1093/bioinformatics/btu613.

• Zerbino, Daniel R., Premanand Achuthan, Wasiu Akanni, M. Ridwan Amode, Daniel Barrell, Jyothish Bhai, Konstantinos Billis, et al. “Ensembl 2018.” Nucleic Acids Research 46, no. D1 (2018): D754–61. https://doi.org/10.1093/nar/gkx1098.