InGAP-CDG
Description
Currently, most gene prediction methods detect coding sequences (CDSs) from transcriptome assembly when lacking of closely related reference genomes. However, these methods are of limited application due to highly fragmented transcripts and extensive assembly errors, which may lead to redundant or false CDS predictions. Here we present a novel algorithm, inGAP-CDG, for effective construction of full-length and non-redundant CDSs from unassembled transcriptomes. inGAP-CDG achieves this by combining a newly developed codon-based de bruijn graph to simplify the assembly process and a machine learning based approach to filter false positives. Compared with other methods, inGAP-CDG exhibits significantly increased predicted CDS length and robustness to sequencing errors and varied read length.
License
GNU General Public License
Usage
Upcoming modulesystem change alert!
Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.
You can test the new version now by adding a line
source /cvmfs/software.metacentrum.cz/modulefiles/5.1.0/loadmodules
to your script before loading a module. Then, you can list all versions of ingap-cdg and load default version of ingap-cdg as
module avail ingap-cdg/ # list available modules module load ingap-cdg # load (default) module
If you wish to keep up to the current system, it is still possible. Simply list all modules by
module avail ingap-cdg
and choose explicit version you want to use.
Documentation
(1) inGAP-CDG_readToCDS ./inGAP-CDG_readToCDS [options] Options: -i, --input_file Please enter your input filename (in fasta format). [required] -o, --output_dir Please enter your output directory filename. If not exists, the program will create it. -n, --threads_num The thread number supported by openmp. [default: 1] -L, --train_seq_len The minimal length of CDSs in positive data set for SVM. [default: 1000] -l, --potential_ORFs_cutoff The potential ORFs that are larger than --potential_ORFs_cutoff*read_length will be kept. [default: 0.8] -d, --svm_dev The SVM classification vector value to filter false positive ORFs. [default: 0] [-0.1, 0.1] -k, --kmer_length The kmer size (a triple number) used to construct codon-based de Bruijn graph. [default: 27] -p, —-subgraph_size The minimal number of subgraph size used to traverse. [default: 300] -t, --tips_length The cutoff length of tips to be trimmed in de Bruijn graph. [default: 2*kmer_length] -h, --help Display the help information for options. (2) inGAP-CDG_transcriptToCDS ./inGAP-CDG_transcriptToCDS [options] Options: -i, --input_file Please enter your input filename (in fasta format). [required] -o, --output_dir Please enter your output directory filename. If not exists, the program will create it. -n, --threads_num The thread number supported by openmp. [default: 1] -L, --train_seq_len The minimal length of CDSs in positive data set for SVM. [default: 1500] -l, --test_seq_len The length of test sequence for SVM. [default: 100] -d, --svm_dev The SVM classification vector value to filter false positive ORFs. [default: 0] [-0.1, 0.1] -k, --kmer_length The kmer size (a triple number) used to construct codon-based de Bruijn graph. [default: 27] -p, —-subgraph_size The minimal number of subgraph size used to traverse. [default: 300] -t, --tips_length The cutoff length of tips to be trimmed in de Bruijn graph. [default: 2*kmer_length] -h, --help Display the help information for options.