Samples
Input data
DeepPI converts a database of protein sequences of flexible length into images and classifies protein classes based on deep learning. A protein database is required as input and supports FASTA formatted files by default. The software automatically splits and categorizes a large single file (e.g. Pfam-A.fasta) for each protein family.
Datasets Collection
The pfam protein families database (Pfam)
Pfam is the most comprehensive database of protein family. It uses a Hidden Markov Model to group multiple sequence alignments of families and consists of two databases, Pfam-A and Pfam-B.
DeepPI used the Pfam-A 35.0 version sequence dataset from EMBL-EBL (http://ftp.ebi.ac.uk/pub/databases/Pfam/).
Download and use the Pfam-A datasets.
- Example of running the entire process
cd DeepPI
python main.py --mode <all> --input <Pfam-A.fasta>
- Example of running the image generator
cd DeepPI
python main.py --mode <image> --input <Pfam-A.fasta>
Sample train & test dataset
The local hardware environment of the user may not be sufficient for the full protein database. Users should have enough space to store the large protein database and the generated image database. We support trainset and testset organized in (.npy) format for user convenience.
Download and use the sample datasets.
Sample | Families | Sequences | Filtered proteins | Trainset | Testset |
Top2-30000 | 30 | 2,777,419 | 2,071,406 | 1,553,565 | 517,841 |
Top2-20000 | 54 | 3,694,879 | 2,633,179 | 1,974,901 | 658,278 |
- Example of running the deep learning model
cd DeepPI
python main.py --mode <model> --dataset <dataset-directory>
(e.g., python main.py --mode model --dataset ~/Sample/30000_2/)
After the execution of the algorithm is complete, user can check the saved models per epoch. For instructions on how to download and setup the local Linux environment, read the documentation.