Reducing Protein Index Size for Similarity Search

Sample size: 12700507 publication 10 minutes Evidence: high

Author Information

Author(s): Peterlongo Pierre, Noé Laurent, Lavenier Dominique, Nguyen Van Hoa, Kucherov Gregory, Giraud Mathieu

Primary Institution: IRISA INRIA, Rennes, France

Hypothesis

Can reduced alphabets allow one to decrease the factor αL while preserving the quality of similarity search results?

Conclusion

We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences.

Supporting Evidence

The proposed method reduces the index size by 35% without sacrificing the quality of results.
New substitution score matrices were developed for comparing amino acid groups from different alphabets.
The study provides a practical application of reduced alphabets to real biological data.

Takeaway

This study shows how to make searching for similar proteins faster and use less memory by grouping amino acids into smaller sets.

Methodology

The study involved comparing protein sequences using a new indexing method that reduces the size of the data while maintaining search quality.

Limitations

The method assumes ungapped alignments, which may not always reflect biological sequences accurately.

Digital Object Identifier (DOI)

10.1186/1471-2105-9-534

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication

Home