GTG: Organisation of protein sequence space for efficient database searching
|Principal Investigator(s)||Liisa Holm|
Protein sequences can be grouped into families based on sequence similarities. Families share important biological properties and identifying family membership is used to infer complex functions. For families where sequence similarities are strong, well-established database searching techniques may be sufficient. However, diverse families contain sequences with low similarities and more sophisticated searching techniques are required.
A new paradigm for protein sequence database searching has been introduced which is based on preprocessing the sequence data to do high-order transitive searches in the most promising directions. The resulting database searches are faster and superior to existing sequence-structure threading methods at recognizing distantly related sequences, and are important to many bioinformatics applications including functional annotation and template identification.
The key structure is an underlying graph which links residues from all known proteins and conceptually represents columns of a hypothetical multiple sequence alignment of all proteins where edges represent alignment scores. During the last calculation, the graph contained almost 40 billion edges and took almost a month to calculate on an 8-processor machine. However, the size of the sequence databases are expected to quadruple in the next few months from the new shotgun sequencing techniques of multiple genomes, making the following update intractable on our existing architecture.
The goal of this project is to update the data structure GTG which is used to drive bioinformatics applications that we are continuing to develop in-house. Porting of the code and parallelisation to supercomputer architecture would be performed with the assistance of CSC, which is the home organisation for DEISA in Finland.