SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING
GEORGIA INSTITUTE OF TEHNOLOGY
ECE 8823A: BIOINFORMATICS AND BIO-SIGNAL PROCESSING
FALL 2004

 

LECTURE NOTES

Note1
Note2

COURSE INFORMATION

OBJECTIVE

The goal of this course is to provide a comprehensive coverage of principles of bioinformatics, including various algorithms for well-known applications. The use of signal processing tools will be emphasized.

INSTRUCTOR
Prof. Yucel Altunbasak
Office: GCATT-370
Phone: 404 385 1341
E-mail: yucel@ece.gatech.edu
Office hours: MW (10:05-10:55) in BH-308

TIME & LOCATION
MWF, 11:05-11:55AM
E261 Van Leer-Elec Eng

TEXTBOOK
Required Text: Biological sequence analysis: Probabilistic models of proteins and nucleic acids, R. Durbin, S. Eddy, A. Krogh, and G. Mitchison
 

Text-2: Algorithms on strings, trees and sequences, Dan Gusfield

Text-3: Computational molecular biology: An algorithmic approach, Pavel A. Pevzner

Text-4: Bioinformatics: The machine learning approach (second edition), Pierre Baldi and S0ren Brunak

Text-5: Introduction to protein structure (second edition), Carl Branden & John Tooze

HONOR CODE
Please uphold the academic honor code (see http://www.gatech.edu/honor/). Violations will be reported to the office of Vice-President for Student Services

GRADING

 

Track #1

Track #2

Midterms 40 60
Finals 30 30
Homework 20 0
Pop-Quiz 10 10
Instructor 10 10


After the first homework, you need to declare which of the two tracks you will be taking.

HOMEWORK
There will be (approximately) bi-weekly homework assignments. Selected homework (questions) will be graded.

POP QUIZZES
We will have pop-quizzes time to time. They will not be announced beforehand, and they may also be given at the beginning of the lectures.

EXPECTATIONS
You should be familiar with introductory probability and statistics concepts. You are expected to have taken courses on linear algebra, probability, random processes, and signal processing.

TENTATIVE OUTLINE

1. INTRODUCTION

2. PAIRWISE ALIGNMENT of BIOMOLECULAR SEQUENCES
2.1. The scoring model
2.2. Global alignment of two sequences: Needleman-Wunsch algorithm
2.3. Local alignment: Smith-Waterman algorithm
2.4. Significance of scores
2.5. Deriving score parameters from alignment data: PAM matrices, BLOCKS database

3. PROBABILISTIC MODELS of DNA SEQUENCES
3.1. Multinomial models
3.2. Models for protein-coding and non-coding regions
3.3. Bayesian inference
3.4. Estimation of model parameters
3.5. Supervised and unsupervised classification

4. MARKOV CHAINS and HIDDEN MARKOV MODELS
4.1. Homogeneous and inhomogeneous Markov models
4.2. Hidden Markov models
4.3. Parameter estimation for hidden Markov models
4.4. HMM model structure
4.5. Numerical stability of HMM algorithms
4.6. Pattern recognition with hidden Markov models: CpG islands in genomic DNA

5. STATISTICAL MODELS of PROTEIN DOMAINS
5.1. Profile HMMs for sequence families
5.2. Parameter estimation for profile HMM
5.3. Protein function prediction by profile HMM
5.4. PSI-BLAST, PFAM and SMART local similarity search methods

6. MULTIPLE SEQUENCE ALIGNMENT METHODS
6.1. What a multiple alignment means
6.2. Gibbs sampling algorithm for multiple sequence alignment
6.3. Progressive alignment methods
6.4. Multiple alignment by profile HMM training

7. PROTEIN SECONDARY STRUCTURE PREDICTION
7.1. Single sequence vs. multiple alignment methods
7.2. Bayesian segmentation of protein secondary structure (BSPSS method)
7.3. Profile based neural network approach (PHD method)
7.4. Information theory and Bayesian approach (GOR method)
7.5. Protein function prediction using secondary structure information

8. BUILDING PHYLOGENETIC TREES
8.1. Construction of the tree by using pairwise distances
8.2. UPGMA clustering and neighbors joining clustering algorithm
8.3. Parsimony
8.4. Probabilistic approaches to phylogeny

9. STUDY of GENE EXPRESSION with DNA MICROARRAYS
9.1. Detecting patterns in expression of multiple genes
9.2. Clustering methods of gene expression data: k-means clustering, self-organizing maps