Phylogenetic sequence data sets across 54,671 eukaryotic genera
GenBank rel. 184 clusters, alignments and trees now available
This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank.
Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI
taxonomy tree.
The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 7500 nt in length).
Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the
sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included).
Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are also provided.
To see a list of "biodiversity research hotspots" (families with the largest increase in species since the last release) click
here (New!).
For a list of model organisms click
here.
For more information on how the clustering was implemented click
here.
For downloads of this or previous releases of the entire database, or downloads of trees only (new!), click
here.
Finally, for more information about the developers, how to cite, etc., click here
Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences. Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.
GenBank release:184 (June 15, 2011) Number of sequences in this database:5798234 Number of nodes in our subtree(s) of the NCBI taxonomy tree:498220 Number of terminal nodes:415135 Number of nodes clustered (usually terminal taxa):372324 Number of subtrees clustered (always internal nodes):81382 Number of nodes with sequences that can be clustered:446604
Clusters:
Total number of clusters:2840487
Number of phylogenetically informative clusters (TIs >= 4):160972
Number of singleton clusters (GIs = 1):2039069
Number of large clusters (GIs >= 100):26530
Number of large clusters (TIs >= 100):6636
Size of largest cluster (w.r.t. GIs):20125
Size of largest cluster (w.r.t. TIs):6222
Supported by a grant from the US NSF Assembling the Tree of Life Program --- Questions or comments? Contact Mike Sanderson (sanderm@email.arizona.edu)