Blast
Description
The NCBI Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
BlastDB
We maintain a local copy of Blast databases – see the /storage/projects/BlastDB
directory. Databases are ready to use.
- For short/single query jobs, you can use the databases directly in storage and refer to them from the batch script by their full path, i.e.
/storage/projects/BlastDB/DB_NAME_PREFIX
. - If you run a longer job, multiple queries or multiple jobs with a particular DB, it is more efficient to copy the database to the scratch directory.
In both cases, refer to the database (-db
option) within your blastn
/blastp
/tblastx
job by its basename only ( e.g. nt, nr, wgs, refseq_genomic). For example -db /storage/projects/BlastDB/nt
.
All available databases are described on the NCBI web. We mirror all of them. The last update of the offline copy is from May 2022. If you need to update DBs or add some new ones, please contact the user support meta@cesnet.cz.
A new DB release contains very large GI numbers (GenInfo Identifier) which are incompatible with older versions of blast
modules. Use the latest version of the blast
module to prevent potential incompatibilities.
Usage
Blast: Upcoming modulesystem change alert!
Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.
You can test the new version now by adding a line
source /cvmfs/software.metacentrum.cz/modulefiles/5.1.0/loadmodules
to your script before loading a module. Then, you can list all versions of blast and load default version of blast as
module avail blast/ # list available modules module load blast # load (default) module
If you wish to keep up to the current system, it is still possible. Simply list all modules by
module avail blast
and choose explicit version you want to use. Blast+: Upcoming modulesystem change alert!
Due to large number of applications and their versions it is not practical to keep them explicitly listed at our wiki pages. Therefore an upgrade of modulefiles is underway. A feature of this upgrade will be the existence of default module for every application. This default choice does not need version number and it will load some (usually latest) version.
You can test the new version now by adding a line
source /cvmfs/software.metacentrum.cz/modulefiles/5.1.0/loadmodules
to your script before loading a module. Then, you can list all versions of blast+ and load default version of blast+ as
module avail blast+/ # list available modules module load blast+ # load (default) module
If you wish to keep up to the current system, it is still possible. Simply list all modules by
module avail blast+
and choose explicit version you want to use.
BLAST programs:
- blastp - compares an amino acid query sequence against a protein sequence database
- blastn - compares a nucleotide query sequence against a nucleotide sequence database
- blastx - compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- tblastn - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- tblastx - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database
BLAST command line - manual and examples:
- https://www.ncbi.nlm.nih.gov/books/NBK279675/
- blast database: makeblastdb -input_type fasta -in FASTA_FILE -dbtype nucl -title NAME -out NAME
- blastn: blastn -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
- blastp: blastp -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
- tblastx: tblastx -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
- blastx: blastx -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
- tblastn: tblastn -db DATABASE_NAME -query INPUT_FASTA -out OUTPUT_NAME -max_target_seqs 1 -evalue 1e-5 -num_threads 8
Optimize network load when running multiple jobs with the same DB
If you need to run several BLAST jobs with the same database, we ask user to optimize the network load by copying the database only once and using it for all the jobs running on the same node. (This requires that you don't clean the content of your scratch directory after the first job is finished!)
This can be done by inserting following construction into the batch script:
DB="nt" # name of the database you need to use
TIMEOUT=120
TIMEWAIT=0
LINKDB=false
...
# Enter the scratch dir
cd "$SCRATCHDIR" || exit 4
...
# search the content of all your other scratch directories on that node
# and look for a file called ${DB}.db_here
LOCAL_DB=$(find .. -name ${DB}.db_here -print -quit) # LOCAL_DB contains a path as well, contrary to DB
# if the file exists, do...
if [ -n "$LOCAL_DB" ]; then
LINKDB=true
LOCAL_DB="${LOCAL_DB%%.db_here}" # cut off the ".db_here" suffix
# if in that scratchdir where LOCAL_DB resides does NOT exist a file "${LOCAL_DB}.db_is_ready", wait for it
while ! test -f "${LOCAL_DB}.db_is_ready"; do
sleep 5
TIMEWAIT=$((TIMEWAIT+5))
if [ $TIMEWAIT -gt $TIMEOUT ]; then
echo "timed out"
break
LINKDB=false
fi
done
fi
# the DB exists somewhere on this machine and is complete, so we can link it
if $LINKDB; then
ln "${LOCAL_DB}"* . || exit 5 # link everything into current scratch directory
# the DB either does not exist on this machine or is not complete, so copy it from /storage/projects
else
touch ${DB}.db_here
cp -p /storage/projects/BlastDB/${DB}* . && touch ${DB}.db_is_ready || exit 6
# ${DB}.db_is_ready is empty file just telling your other future jobs on this machine that the cp operation has finished
export CLEAN_SCRATCH=false # do not remove content of this scratch
fi
....
# then run the calculation
blastp -db "./${DB}" -query INPUT_FASTA -out OUTPUT_NAME ...
Documentation
http://www.ncbi.nlm.nih.gov/books/NBK1762/
Licence
This software/database is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the author's official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.