Interpol: An R package for preprocessing of protein sequences
2011

Interpol: An R package for preprocessing of protein sequences

Sample size: 1351 publication Evidence: moderate

Author Information

Author(s): Heider Dominik, Hoffmann Daniel

Primary Institution: Department of Bioinformatics, Center for Medical Biotechnology, University of Duisburg-Essen

Hypothesis

The study aims to improve machine learning techniques for biological sequences by developing a preprocessing approach that normalizes protein sequences.

Conclusion

The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, improving their performance in classification and regression.

Supporting Evidence

  • Interpol encodes amino acid sequences as numerical descriptor vectors using a database of 532 descriptors.
  • The software normalizes sequences to uniform length with five interpolation algorithms.
  • Interpol is distributed as an open-source R package and is available on CRAN.

Takeaway

The Interpol package helps scientists prepare protein sequences for analysis by making them all the same length, which makes it easier to use computer programs to understand them.

Methodology

The study developed an R package called Interpol that encodes amino acid sequences using numerical descriptors and normalizes them to a fixed length using interpolation methods.

Potential Biases

The normalization process may lose important information related to sequence length, which can be critical for certain classifications.

Limitations

Normalizing sequences to lengths less than 50% of the original length may lead to loss of information, and stretching short sequences can increase the risk of overfitting.

Digital Object Identifier (DOI)

10.1186/1756-0500-4-94

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication