DeepPI: alignment-free analysis of flexible length proteins based on image generator and deep learning

Samples

Input data

DeepPI converts a database of protein sequences of flexible length into images and classifies protein classes based on deep learning. A protein database is required as input and supports FASTA formatted files by default. The software automatically splits and categorizes a large single file (e.g. Pfam-A.fasta) for each protein family.

Datasets Collection

The pfam protein families database (Pfam)

Pfam is the most comprehensive database of protein family. It uses a Hidden Markov Model to group multiple sequence alignments of families and consists of two databases, Pfam-A and Pfam-B.

DeepPI used the Pfam-A 35.0 version sequence dataset from EMBL-EBL (http://ftp.ebi.ac.uk/pub/databases/Pfam/).

Download and use the Pfam-A datasets.

Example of running the entire process

cd DeepPI 
python main.py --mode <all> --input <Pfam-A.fasta>

Example of running the image generator

cd DeepPI 
python main.py --mode <image> --input <Pfam-A.fasta>

Sample train & test dataset

The local hardware environment of the user may not be sufficient for the full protein database. Users should have enough space to store the large protein database and the generated image database. We support trainset and testset organized in (.npy) format for user convenience.

Download and use the sample datasets.

Sample	Families	Sequences	Filtered proteins	Trainset	Testset
Top2-30000	30	2,777,419	2,071,406	1,553,565	517,841
Top2-20000	54	3,694,879	2,633,179	1,974,901	658,278

Example of running the deep learning model

cd DeepPI 
python main.py --mode <model> --dataset <dataset-directory> 
(e.g., python main.py --mode model --dataset ~/Sample/30000_2/)

After the execution of the algorithm is complete, user can check the saved models per epoch. For instructions on how to download and setup the local Linux environment, read the documentation.