GenomeAnalysisTK (GATK)

From MetaCentrum
Jump to navigation Jump to search


The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.


Apache License 2.0 (licensing of versions 3 and 4 differs)


Upcoming modulesystem change alert!

Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.

You can test the new version now by adding a line

source /cvmfs/

to your script before loading a module. Then, you can list all versions of gatk and load default version of gatk as

module avail gatk/ # list available modules
module load gatk   # load (default) module

If you wish to keep up to the current system, it is still possible. Simply list all modules by

module avail gatk

and choose explicit version you want to use.

Version 4

First, you have to prepare your environment by executing:

 module add gatk-

GATK 4 version has a wrapper script gatk, which significantly simplifies commands. Now, you can just run

  gatk --help    # to print help
  gatk --list    # to list of all available tools inside the toolkit

to get a list of all available tools in the toolkit. This is the basic structure of invocation of a tool named ToolName:

  gatk [--java-options "jvm args like -Xmx4G go here"] ToolName [GATK args go here]

This is how a command might look like in real world:

  gatk --java-options "-Xmx8G" HaplotypeCaller -R reference.fasta -I input.bam -O output.vcf

If you are not familiar with this syntax, please see the official Getting started with GATK4.

Version 3.8-0 and older

Example of environment initialization:

module add gatk-2.7.2


module add gatk-3.7


module add gatk-3.8-0

Initialization makes available also java 7 (or java 8 for version 3.7 and 3.8) and system variable $GATK pointing into GATK install dir. Usage of one of the tools with sample data (not for version 3.8-0):

java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar -T CountReads -R "$GATK"/resources/exampleFASTA.fasta -I "$GATK"/resources/exampleBAM.bam

During large data processing, some problems with size of tmp directory can occurs (and can lead to the end of job or significant slowdown). In this case, add parameter"${SCRATCHDIR}"/tmp into java command.

List of tools and version check:

java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar --help
java -Xmx2g -jar "$GATK"/GenomeAnalysisTK.jar --version