Reducing Protein Index Size for Similarity Search
Author Information
Author(s): Peterlongo Pierre, Noé Laurent, Lavenier Dominique, Nguyen Van Hoa, Kucherov Gregory, Giraud Mathieu
Primary Institution: IRISA INRIA, Rennes, France
Hypothesis
Can reduced alphabets allow one to decrease the factor αL while preserving the quality of similarity search results?
Conclusion
We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences.
Supporting Evidence
- The proposed method reduces the index size by 35% without sacrificing the quality of results.
- New substitution score matrices were developed for comparing amino acid groups from different alphabets.
- The study provides a practical application of reduced alphabets to real biological data.
Takeaway
This study shows how to make searching for similar proteins faster and use less memory by grouping amino acids into smaller sets.
Methodology
The study involved comparing protein sequences using a new indexing method that reduces the size of the data while maintaining search quality.
Limitations
The method assumes ungapped alignments, which may not always reflect biological sequences accurately.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website