Notes on the sequence clustering pipeline (...click here for software and downloads)

Every node in the NCBI taxonomy tree has an associated collection of sequences for it, which consists of those sequences for itself (if any) and all its descendants. For each node these sequences are "clustered" as long as the number of sequences is not too large (presently 35,000 sequences). The purpose of the clustering pipeline is to assemble sets of sequences together that have at least local homologies (i.e., matching or nearly matching subsequences). These can form the basis for the construction of individual phylogenetic data sets. Subsequently, clusters can be aligned and then combined using supermatrix, supertree, or other approaches.
We use BLAST at the core of this pipeline to identify all local homologies ("hits") between every pair of sequences. Next, the hits can be filtered in a variety of ways. For example, for phylogenetic purposes it is ideal to have sequences of nearly the same length so that alignment programs work well and there is little missing data. Thus, the list of hits can be filtered to exclude small regions of local homology. Once a final list of hits is obtained, a set of clusters of sequences is built. Presently, we filter by keeping only hits that have greater than 51% coverage in both directions (at a stringent BLAST e-value cutoff of -10).

Then the filtered hit list is turned into a set of clusters via "single-linkage clustering". To be a member of such a cluster, a sequence merely has to have a hit with any other member of the cluster. Even at this stage, stricter, smaller, clusters can be obtained by other clustering methods, such as complete linkage clustering. Clustering algorithms are an active area of research and many alternatives are available. The output of the pipeline is thus a set of clusters, each of which contains one or more sequences.

Caveats
  • The clusters have been constructed using a large set of specific BLAST parameters and filtering criteria. The composition of the clusters is sensitive to some of these parameters to varying degrees!
  • The sequences in each cluster are unaligned. They are merely guaranteed to have some subregions of homology to other sequences in the cluster.
  • Clusters can be quite heterogeneous in length. The "maximum alignment density" (MAD) reported on the individual cluster pages is helpful in identifying such cases. Multiple alignment programs have difficulty handling data sets for which this MAD value is low (including multiple local alignment programs with which we have experimented).
  • On the other hand, clusters may sometimes seem puzzling when they dismember what might seem like homologous sequences based on their annotations. This often occurs at moderately high divergence values when the length of the regions that are locally homologous according to BLAST shrinks below the 51% threshold. The user may want to consider putting all of these sequences together in a cluster if alignment programs can handle the heterogeneity. Otherwise, it may be preferable to align the clusters separately and then combine in a supermatrix or supertree procedure.


Treatment of model organisms

Model organisms are defined as any node in the NCBI tree in which there are >10,000 sequences (in which case no clusters are constructed for that node), or, if <10,000 sequences, the clustering procedure outlined above produces > 100 clusters. Thus, organisms that have many sequences from one locus (e.g., from population genetic studies) will have only a small number of clusters and will not be considered as models. At present several hundred model organisms are recognized according to these criteria: most have <10,000 sequences but >100 clusters.

Model organisms received special handling for the construction of the Phylota Browser database. The user can select whether or not sequence tallies for higher taxa report or do not report sequences from model organisms within their group. More fundamentally, construction of the higher taxon clusters treats model organisms differently. Since each tends to contain a large to very large number of sequences, most of which are phylogenetic singletons, it is computationally expensive and a bit wasteful to include them in all-against-all BLAST searches at each higher taxonomic level. Instead, we initially exclude them from clustering, build all clusters in the database without them, and then use BLAST to find sequences in the model organisms that are homologous to these already-constructed clusters. This can be done quite efficiently, at the risk of some considerable dependence on the representation of clusters in the phylogenetic neighborhood of the model organisms. In other words, this procedure does not add new clusters to the database; it merely adds sequences to those clusters that would have been present anyway in the relatives of the model organisms. Nonetheless, in practice it seems to convert many phylogenetically uninformative clusters to informative clusters around the model organisms and thereby increase the density of the data availability matrices for informative data in these regions of the tree.