CEGMA
Description
CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome. The strategy relies on a simple fact: some highly conserved proteins are encoded in essentially all eukaryotic genomes. We use the KOGs database to build a set of these highly conserved ubiquitous proteins. We define a set of 458 core proteins, and the protocol, CEGMA, to find orthologs of the core proteins in new genomes and to determine their exon-intron structures.
The procedure uses information from the core genes of six model organisms by first using TBLASTN to identify candidate regions in a new genome. It then proposes and redefines gene structures using a combination of GeneWise, HMMER and geneid. The system includes the use of a profile for each core protein to ensure the reliability of the gene structure.
License
GNU GENERAL PUBLIC LICENSE
Usage
Upcoming modulesystem change alert!
Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.
You can test the new version now by adding a line
source /cvmfs/software.metacentrum.cz/modulefiles/5.1.0/loadmodules
to your script before loading a module. Then, you can list all versions of cegma and load default version of cegma as
module avail cegma/ # list available modules module load cegma # load (default) module
If you wish to keep up to the current system, it is still possible. Simply list all modules by
module avail cegma
and choose explicit version you want to use.
Cegma request acces to Blast+
module avail blast
Afterwards you can submit following command
cegma
You will see list of parametrs which you can use to run the appliacation.
If you use multiple cores specify the number of threads by option --threads <number>
. You can use variable $PBS_NUM_PPN to get the number of reserved CPU.
--threads $PBS_NUM_PPN
Documentation
On-line documentation you can find on http://korflab.ucdavis.edu/Datasets/cegma Original article is on http://bioinformatics.oxfordjournals.org/content/23/9/1061.full.pdf+html