Max-Planck-Institut für Informatik
max planck institut
mpii logo Minerva of the Max Planck Society

Protein Structure Prediction

Our fully automated protein structure prediction server Arby combines the results of several fold recognition methods to find suitable templates in a database of structural representatives of protein domains.

The method starts by constructing a set of subsequences from the query sequence, each subsequence representing a hypothesis for a possible protein domain. This is done by scanning against the InterPro database and using hits as domain hypotheses [1]. Additional hypotheses are constructed using a secondary structure prediction from PSIPRED [2]. Segments of predicted loops are used as potential domain boundaries. Finally, the set of subsequences is reduced to a reasonable size by removing subsequences that are highly similar or short.

For each subsequence a multiple alignment is constructed by searching the NR database using PSI-BLAST [3]. A frequency profile is calculated from this multiple alignment using a slightly modified version of the Henikoff-Henikoff sequence-weighting algorithm [4].

Each of the potential domains is then subjected to four different fold recognition methods. Each method searches for an optimal structure in our template database. The template database is a representative subset of the SCOP domains with pairwise sequence identity lower than 40% [5, 6]. For each of these template domains, a frequency profile was constructed as described above for the targets. The first fold recognition method is PSI-BLAST, which is used to search through our set of template domains (augmented by the NR sequence database). The second one is the 123D threading program. It uses frequency profiles on the target side and 3D structural information on the template side [7, 8]. The third one is the JProp profile-profile alignment method recently developed in our group [9, 10]. It compares frequency profiles on the target side with profiles on the template side using the log average scoring approach. The fourth method is again the JProp profile-profile alignment program, but in this version it makes use of additional secondary structure information on the target and template side (publication in preparation).

The quality of each of these search results is assessed using confidence measures. For PSI-BLAST, these are readily available [11], for the other methods, these were developed in a recent study [12].

The target sequence is then annotated with all the produced quadruplets (subsequence, fold recognition method, search result, confidence value). Finally, we select a set of non-overlapping annotations along the sequence, by performing combinatorial optimization of a heuristic score based on the confidence values. For each of these selected annotations, a separate protein domain is predicted. The structure of this domain prediction is computed by aligning the subsequence against the template structure using JProp.

The underlying machinery is a Java based data flow engine, designed for stability. Since it is general and independent of the specific pipeline (as the one described above), it can be used as infrastructure for other projects as well: we developed a component framework in which all algorithms and programs are encapsulated in small Java classes. Each of these components specifies an algorithm to be executed along with its input parameters, the output that it produces, and possible error conditions. The accompanying engine provides a number of features for the components: First of all, the input/output dependencies of components are resolved. If all inputs for a specific algorithm have been determined, the algorithm itself is being scheduled for execution. The components are executed in parallel on any number of CPUs, in our case 10 CPUs of a SunFire 4800 server. A frequent problem in fully automated systems is reliable error handling. We solve this problem by catching potential error conditions and adaptively pruning the data-flow tree. Additionally, persistence of the computed results is accomplished by using a relational database, thus offering convenient and fast access to previously computed results for identical input parameters.

The power of the structure prediction server is based on the use of modern profile-profile algorithms for fold recognition, the quality assessment using confidence measures, and the stable and powerful Java data flow engine. In future work, we will use the latter technology as a basis for our bioinformatics computing environment.

We thank Daniel Hanisch for providing contributions to the Java implementation. Part of this research has been supported by BMBF grant no. 01 SF 9984/3 (Helmholtz Network for Bioinformatics)


  1. Apweiler, R., et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29 (1), 37-40.
  2. Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 292 (2), 195-202.
  3. Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-402.
  4. Henikoff, S. and Henikoff, J.G. (1994) Position-based sequence weights. J Mol Biol. 243 (4), 574-8.
  5. Chandonia, J.M., et al. (2002) ASTRAL compendium enhancements. Nucleic Acids Res. 30 (1), 260-3.
  6. Brenner, S.E., Koehl, P., and Levitt, M. (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28 (1), 254-6.
  7. Zien, A., Zimmer, R., and Lengauer, T. (2000) A simple iterative approach to parameter optimization. J Comput Biol. 7 (3-4), 483-501.
  8. Alexandrov, N.N., Nussinov, R., and Zimmer, R. (1996) Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. Pac Symp Biocomput, 53-72.
  9. Von Öhsen, N., Sommer, I., and Zimmer, R. (2003) Profile-Profile Alignment: A Powerful Tool For Protein Structure Prediction. Pac Symp Biocomput.
  10. Von Öhsen, N. and Zimmer, R. (2001) Improving profile-profile alignment via log average scoring. Lecture Notes in Computer Science. 2149, 11-26.
  11. Karlin, S. and Altschul, S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 87 (6), 2264-8.
  12. Sommer, I., et al. (2002) Confidence measures for protein fold recognition. Bioinformatics. 18 (6), 802-12.