Interpol: An R package for preprocessing of protein sequences
Author Information
Author(s): Heider Dominik, Hoffmann Daniel
Primary Institution: Department of Bioinformatics, Center for Medical Biotechnology, University of Duisburg-Essen
Hypothesis
The study aims to improve machine learning techniques for biological sequences by developing a preprocessing approach that normalizes protein sequences.
Conclusion
The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, improving their performance in classification and regression.
Supporting Evidence
- Interpol encodes amino acid sequences as numerical descriptor vectors using a database of 532 descriptors.
- The software normalizes sequences to uniform length with five interpolation algorithms.
- Interpol is distributed as an open-source R package and is available on CRAN.
Takeaway
The Interpol package helps scientists prepare protein sequences for analysis by making them all the same length, which makes it easier to use computer programs to understand them.
Methodology
The study developed an R package called Interpol that encodes amino acid sequences using numerical descriptors and normalizes them to a fixed length using interpolation methods.
Potential Biases
The normalization process may lose important information related to sequence length, which can be critical for certain classifications.
Limitations
Normalizing sequences to lengths less than 50% of the original length may lead to loss of information, and stretching short sequences can increase the risk of overfitting.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website