Phylogenetic sequence data sets across 61,108 eukaryotic genera
GenBank rel. 194 clusters now available (alignments and trees under construction)
This medical database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank. Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI taxonomy tree. The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 7500 nt in length). Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the sequence diversity in the database. Note, however, that the bulk of “genomic” data for model organisms is not entered in the database at all (see below for types of sequences included). Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are also provided.