PhyLoTA Browser

Phylogenetic sequence data sets across 61,108 eukaryotic genera

GenBank rel. 194 clusters now available (alignments and trees under construction)



This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank. Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI taxonomy tree. The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 7500 nt in length). Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included). Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are also provided.

To see a list of "biodiversity research hotspots" (families with the largest increase in species since the last release) click here (New!). For a list of model organisms click here. For more information on how the clustering was implemented click here.

For downloads of this or previous releases of the entire database, or downloads of trees only (new!), click here. Finally, for more information about the developers, how to cite, etc., click here


Query with a taxon name or id number:
  All search options
   Examples: Amorpha or Amor* or Amorpha * or 48130

Quick links to specific nodes:

Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences.
Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.

GenBank release:194 (Feb. 15, 2013)
Number of sequences in this database:7306016
Number of nodes in our subtree(s) of the NCBI taxonomy tree:601559
Number of terminal nodes:508456
Number of nodes clustered (usually terminal taxa):423425
Number of subtrees clustered (always internal nodes):90477
Number of nodes with sequences that can be clustered:504921

Clusters:

Supported by a grant from the US NSF Assembling the Tree of Life Program --- Questions or comments? Contact Mike Sanderson (sanderm@email.arizona.edu)