GenomeAnalysisTK (GATK)
Description
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
License
Apache License 2.0 (licensing of versions 3 and 4 differs)
Usage
Upcoming modulesystem change alert!
Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.
You can test the new version now by adding a line
source /cvmfs/software.metacentrum.cz/modulefiles/5.1.0/loadmodules
to your script before loading a module. Then, you can list all versions of gatk and load default version of gatk as
module avail gatk/ # list available modules module load gatk # load (default) module
If you wish to keep up to the current system, it is still possible. Simply list all modules by
module avail gatk
and choose explicit version you want to use.
Version 4
First, you have to prepare your environment by executing:
module add gatk-4.1.6.0
GATK 4 version has a wrapper script gatk, which significantly simplifies commands. Now, you can just run
gatk --help # to print help
gatk --list # to list of all available tools inside the toolkit
to get a list of all available tools in the toolkit. This is the basic structure of invocation of a tool named ToolName
:
gatk [--java-options "jvm args like -Xmx4G go here"] ToolName [GATK args go here]
This is how a command might look like in real world:
gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf
If you are not familiar with this syntax, please see the official Getting started with GATK4.
Version 3.8-0 and older
Example of environment initialization:
module add gatk-2.7.2
or
module add gatk-3.7
or
module add gatk-3.8-0
Initialization makes available also java 7 (or java 8 for version 3.7 and 3.8) and system variable $GATK
pointing into GATK install dir. Usage of one of the tools with sample data (not for version 3.8-0):
java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar -T CountReads -R "$GATK"/resources/exampleFASTA.fasta -I "$GATK"/resources/exampleBAM.bam
During large data processing, some problems with size of tmp
directory can occurs (and can lead to the end of job or significant slowdown). In this case, add parameter -Djava.io.tmpdir="${SCRATCHDIR}"/tmp
into java
command.
List of tools and version check:
java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar --help java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar --version
Documentation
https://gatk.broadinstitute.org/hc/en-us