SECLAF is a protein- or DNA sequence classification framework. SECLAF allows you to easily design, train and test deep neural networks for biological sequence classification.
You can find examples on using SECLAF under the examples/ subfolder.
If you use SECLAF in your research, you are advised to cite the following publication:
B. Szalkai and V. Grolmusz: Near Perfect Protein Multi-Label Classification with Deep Neural Networks. https://arxiv.org/abs/1703.10663
SECLAF is released under the GNU General Public License version 3.0.
Copyright (c) 2016-2017 Balazs Szalkai This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
The names given above will be used in this tutorial to refer to the respective files, but you can give different names to these files when using SECLAF.
The tree file must contain a list of the possible sequence classes and the "is_a" relation (terminology borrowed from Gene Ontology). The "is_a" relation describes which classes are superclasses/subclasses of each other: if A is_a B, then A is a subclass (child, refinement) of B, while B is a parent (superclass) of A, and each sequence in A also belongs to B.
Example:
// Tree file for classification hierarchy GO:1990413 GO:0043229 eyespot apparatus GO:1900309 GO:0006109 regulation of maltoheptaose metabolic process GO:1900308 GO:1900306&GO:1900296 positive regulation of maltoheptaose transport ...
Each line can be one of these:
This file must contain the list of sequences on which the network should be trained, along with their annotations (assigned classes). Either all sequences must be DNA sequences, or all sequences must be protein sequences. It may be either uncompressed or gzipped (train_set.ann.gz). Two file formats are acceptable:
Example:
// Annotation file (sequence, classes) MTNKNTSKDMHKNAPKGHNPGQPEPLSGSKKVKNRNHTRQKHNSSHDM GO:0030435 MANSAQAKKRARQNEKRELHNASQRSAVRTAVKKILKSLQANDSSAAQSAYQHAVQILDKAAGRRIIHPNKAARLKSRLSQKIKNLSSSQ GO:0003735&GO:0005840&GO:0006412&GO:0019843 MSQRSAVRTAVKKILKSLQANDSSAAQSAYQHAVQILDKAAGRRI - ...
Each line can be one of these:
The above example in FASTA format:
>Sequence_1 classes=GO:0030435 MTNKNTSKDMHKNAPKGHNPGQPEPLSGSKKVKNRNHTRQKHNSSHDM >Sequence_2 classes=GO:0003735&GO:0005840&GO:0006412&GO:0019843 MANSAQAKKRARQNEKRELHNASQRSAVRTAVKKILKSLQANDSSA AQSAYQHAVQILDKAAGRRIIHPNKAARLKSRLSQKIKNLSSSQ >Sequence_3 MSQRSAVRTAVKKILKSLQANDSSAAQSAYQHAVQILDKAAGRRI ...
For each sequence, its classification must also be defined. SECLAF searches for the string classes= in the header string, and the remaining part of the header line is interpreted as the class list. The classes should be separated by & (ampersand). If the sequence belongs to no classes, the classes= string should be omitted.
This file contains the sequences of the test set. It must have the same format as train_set.ann.
This file contains the network description and various parameters.
These are the possible keys for config.json. Configuration parameters without a default value are required, the remaining are optional:
To train your model you will have to run seclaf_train.py with e.g. the following command:
python seclaf_train.py config.json
To use your model to infer the classes of some sequences, use the following command:
python seclaf_infer.py config.json input_seq_file [output_file]
Here input_seq_file can be a text file containing one sequence per line, or a FASTA file. output_file is a filename for the output (sequences + inferred classes). If omitted, the classification obtained with the model described in config.json will be printed on the console (standard output).
The network must be described in config.json, under the key 'network'.
Example:
"network": [ {"layer":"conv", "k":6, "d":128, "act":"prelu"}, {"layer":"bn", "scope":"seq_batch_norm_0"}, {"layer":"conv", "k":6, "d":128, "act":"prelu"}, {"layer":"maxpool", "k":2}, {"layer":"bn", "scope":"seq_batch_norm_1"}, {"layer":"conv", "k":5, "d":256, "act":"prelu"}, {"layer":"maxpool", "k":2}, {"layer":"bn", "scope":"seq_batch_norm_2"}, {"layer":"conv", "k":5, "d":256, "act":"prelu"}, {"layer":"maxpool", "k":2}, {"layer":"bn", "scope":"seq_batch_norm_3"}, {"layer":"conv", "k":5, "d":512, "act":"prelu"}, {"layer":"maxpool", "k":2}, {"layer":"bn", "scope":"seq_batch_norm_4"}, {"layer":"conv", "k":5, "d":512, "act":"prelu"}, {"layer":"maxpool", "k":2}, {"layer":"bn", "scope":"seq_batch_norm_5"}, {"layer":"spp", "levels":3, "div":4}, {"layer":"fc", "n":512, "act":"prelu"}, {"layer":"dropout", "p":0.5, "scope":"fc0_dropout"}, {"layer":"bn", "scale":1, "scope":"fc0_batch_norm"}, {"layer":"fc", "n":"#out", "scope":"fcClassify"} ]
The network consists of several layers. These layers must be listed after each other, ending with the output layer. These are the available layer types, along with their possible configuration parameters (the available activation functions are none, prelu, relu, sigmoid, abs and tanh.
In addition, each layer can have a scope parameter which specifies a name for the layer. If two conv or fc layers have the same scope specified, then they will share the same weights.
The network model of SECLAF can be thought of as a pipeline having two stages which must follow each other in this order:
Technical information: In the variable stage, the data is a matrix shaped [variable_length, channel_count] for each sequence, where variable_length can be different for each sequence, but channel_count is constant. In the fixed stage, the data is a fixed-length vector.
The input sequence must be encoded as an array of numbers for the neural network. The sequence encoder assigns a numeric vector to each nucleotide or amino acid in the sequence.
Available encoders for protein sequences:
Available encoders for DNA sequences: