CD-HIT is a very widely used program for clustering and comparing protein or nucleotide sequences. It is very fast and can handle extremely large databases. CD-HIT helps to significantly reduce the computational and manual efforts in many sequence analysis tasks and aids in understanding the data structure and correct the bias within a dataset.
Version 4.6.1. Freely available to users. Modules:
Initialize environment with command:
module add cdhit-4.6.1
Documentation is at http://weizhong-lab.ucsd.edu/cd-hit/ref.php .