LECTURE NOTES
![]()
COURSE INFORMATION
![]()
OBJECTIVE
The goal of this course is to provide a comprehensive coverage of
principles of bioinformatics, including various algorithms for
well-known applications. The use of signal processing tools will be
emphasized.
INSTRUCTOR
Prof. Yucel Altunbasak
Office: GCATT-370
Phone: 404 385 1341
E-mail: yucel@ece.gatech.edu
Office hours: MW (10:05-10:55) in BH-308
TIME & LOCATION
MWF, 11:05-11:55AM
E261 Van Leer-Elec Eng
TEXTBOOK
Required Text: Biological sequence analysis: Probabilistic
models of proteins and nucleic acids, R. Durbin, S. Eddy, A. Krogh, and
G. Mitchison
Text-2: Algorithms on strings, trees and sequences, Dan
Gusfield
Text-3: Computational molecular biology: An algorithmic
approach, Pavel A. Pevzner
Text-4: Bioinformatics: The machine learning approach (second
edition), Pierre Baldi and S0ren Brunak
Text-5: Introduction to protein structure (second edition), Carl
Branden & John Tooze
HONOR CODE
Please uphold the academic honor code (see
http://www.gatech.edu/honor/). Violations will be reported to the
office of Vice-President for Student Services
GRADING
|
Track #1 |
Track #2 |
|
| Midterms | 40 | 60 |
| Finals | 30 | 30 |
| Homework | 20 | 0 |
| Pop-Quiz | 10 | 10 |
| Instructor | 10 | 10 |
After the first homework, you need to declare which of the two tracks
you will be taking.
HOMEWORK
There will be (approximately) bi-weekly homework assignments. Selected
homework (questions) will be graded.
POP QUIZZES
We will have pop-quizzes time to time. They will not be announced
beforehand, and they may also be given at the beginning of the lectures.
EXPECTATIONS
You should be familiar with introductory probability and statistics
concepts. You are expected to have taken courses on linear algebra,
probability, random processes, and signal processing.
TENTATIVE OUTLINE
1. INTRODUCTION
2. PAIRWISE ALIGNMENT of BIOMOLECULAR SEQUENCES
2.1. The scoring model
2.2. Global alignment of two sequences: Needleman-Wunsch algorithm
2.3. Local alignment: Smith-Waterman algorithm
2.4. Significance of scores
2.5. Deriving score parameters from alignment data: PAM matrices,
BLOCKS database
3. PROBABILISTIC MODELS of DNA SEQUENCES
3.1. Multinomial models
3.2. Models for protein-coding and non-coding regions
3.3. Bayesian inference
3.4. Estimation of model parameters
3.5. Supervised and unsupervised classification
4. MARKOV CHAINS and HIDDEN MARKOV MODELS
4.1. Homogeneous and inhomogeneous Markov models
4.2. Hidden Markov models
4.3. Parameter estimation for hidden Markov models
4.4. HMM model structure
4.5. Numerical stability of HMM algorithms
4.6. Pattern recognition with hidden Markov models: CpG islands in
genomic DNA
5. STATISTICAL MODELS of PROTEIN DOMAINS
5.1. Profile HMMs for sequence families
5.2. Parameter estimation for profile HMM
5.3. Protein function prediction by profile HMM
5.4. PSI-BLAST, PFAM and SMART local similarity search methods
6. MULTIPLE SEQUENCE ALIGNMENT METHODS
6.1. What a multiple alignment means
6.2. Gibbs sampling algorithm for multiple sequence alignment
6.3. Progressive alignment methods
6.4. Multiple alignment by profile HMM training
7. PROTEIN SECONDARY STRUCTURE PREDICTION
7.1. Single sequence vs. multiple alignment methods
7.2. Bayesian segmentation of protein secondary structure (BSPSS method)
7.3. Profile based neural network approach (PHD method)
7.4. Information theory and Bayesian approach (GOR method)
7.5. Protein function prediction using secondary structure information
8. BUILDING PHYLOGENETIC TREES
8.1. Construction of the tree by using pairwise distances
8.2. UPGMA clustering and neighbors joining clustering algorithm
8.3. Parsimony
8.4. Probabilistic approaches to phylogeny
9. STUDY of GENE EXPRESSION with DNA MICROARRAYS
9.1. Detecting patterns in expression of multiple genes
9.2. Clustering methods of gene expression data: k-means clustering,
self-organizing maps
![]()
![]()