Estimating gene, site, and quartet concordance vectors

Last update: Aug 2, 2024, Contributors: Minh Bui

Estimating gene, site, and quartet concordance vectors

Introduction

This recipe provides a worked example of estimating gene, site, and quartet concordance vectors using IQ-TREE2 and ASTRAL-III, beginning with a set of individual locus alignments. A concordance vector consists of four numbers, which include the concordance factor and three other numbers describing all the discordant trees:

  • Ψ1 (the concordance factor for the branch of interest in the species tree)
  • Ψ2 (the largest of the two discordance factors from a single NNI rearrangement of the branch of interest)
  • Ψ3 (the smallest of the two discordance factors from a single NNI rearrangement of the branch of interest)
  • Ψ4 (the sum of the discordance factors that do not make up Ψ2 and Ψ3; note that site and quartet concordance factors always assume that this number is zero)

Citation: this recipe accompanies the paper “The meaning and measure of concordance factors in phylogenomics” by Rob Lanfear and Matt Hahn. Please cite that paper if you use this recipe. This article also describes concordance vectors in a lot more detail.

What you need

Software

First you need the following software:

  • The latest stable version of IQ-TREE2 for your system: http://www.iqtree.org/
  • ASTRAL-III: https://github.com/smirarab/ASTRAL/releases/latest
  • R, and the tidyverse and boot packages
  • A few R scripts to process lots of output files and produce concordance vectors, tree files, and tables from: https://github.com/roblanf/concordance_vectors

I use conda to install all of these, and suggest you do too. If you want to do that, here’s one way to do it:

# set up a fresh environment and activate it
conda create --name concordance
conda activate concordance

# install what we need for this recipe
conda install -c bioconda iqtree astral-tree
conda install -c conda-forge r-base r-tidyverse r-boot r-ape r-ggtext

# get the R script
wget https://raw.githubusercontent.com/roblanf/concordance_vectors/main/concordance_vector.R
wget https://raw.githubusercontent.com/roblanf/concordance_vectors/main/concordance_table.R
wget https://raw.githubusercontent.com/roblanf/concordance_vectors/main/change_labels.R

After installation, double check that you have the latest versions of both pieces of software. You will need IQ-TREE version 2.3 or above for this.

Note that the R script is itself on GitHub here: https://github.com/roblanf/concordance_vectors/blob/main/concordance_vector.R

Data

Next you need the data. The data for this recipe is 400 randomly selected alignments of intergenic regions from the paper “Complexity of avian evolution revealed by family-level genomes” by Stiller et al. in 2024. Each locus has up to 363 species represented from across the diversity of birds, and to the best of the author’s ability to carefully sequence and filter the data, each locus is also a single-copy orthologue. This means that we can go ahead and use these alignments to estimate gene trees, species trees, and concordance vectors.

You can download these 400 alignments from here: bird_400.tar.gz. You will need to decompress this data using the following command

tar -xzf bird_400.tar.gz

For the sake of reproducibility, you can also create your own set of 400 randomly selected loci from the intergenic regions sequenced for this paper using the following commands:

# Get the data from the paper's supplementary data repository
wget https://erda.ku.dk/archives/341f72708302f1d0c461ad616e783b86/B10K/data_upload/01_alignments_and_gene_trees/intergenic_regions/63430.alns.tar.gz

# decompress it
tar -xvzf 63430.alns.tar.gz

# select 400 random loci, then compress them
mkdir -p bird_400
find 63k_alns/ -type f ! -name '.*' | shuf -n 400 | xargs -I {} mv {} bird_400/ # avoid files that start with '.'
tar -czf bird_400.tar.gz -C bird_400 .

The last set of commands will produce a file just like the one you can download above, with 400 randomly selected loci. Note that you should expect to get a slightly different species tree and concordance factors, because there’s a lot of discordance along the backbone of the species tree of birds, so different groups of 400 loci are highly likely to give different species trees.

Estimating the gene trees

To estimate the gene trees, we’ll use IQ-TREE2. Just set -T to the highest number of threads you have available. This step might take some time (about 3.5 hours with my 128 threads). If you prefer to skip it then you can download the key output files from this analysis here: loci.zip

iqtree2 -S bird_400 --prefix loci -T 128

This analysis will produce output files with lots of information, these include (all in the zip file loci.zip linked above):

  • loci.best_model.nex: the substitution models in nexus format - these have every parameter value for every estimated model, e.g. for one locus, the GTR+F+R5 model has the following entry: GTR{1.07109,5.24905,0.776093,1.28398,3.93731}+F{0.212586,0.260909,0.256934,0.269571}+R5{0.0670095,0.202443,0.44871,0.605428,0.379555,1.13675,0.0791399,2.21831,0.0255857,4.21162}: chr10_1260000_1270000.1k.start299.fasta{17.1377}
  • loci.iqtree: a summary file of the entire analysis with tons of useful information neatly summarised
  • loci.log: the full log file from the run (i.e. everything that was printed to the screen during the run)
  • loci.treefile: the Maximum Likelihood single-locus trees estimated using the best-fit models (these trees are what we really want)

Estimating the species tree

You should estimate your species tree using whatever the best approach is for your data, for example a joint Bayesian analysis using BEAST or *BEAST, a two-step analysis e.g. using ASTRAL, or a concatentated analysis using IQ-TREE or RAxML. You may also have a species tree that has already been estimated elsewhere, and just want to map the concordance vectors onto that. In that case, you can skip this step.

For the purposes of this tutorial, we’ll follow the original paper on bird phylogenomics and use ASTRAL to estimate the species tree from the gene trees we just estimated. Note that in the original paper they collapse some branches in the single-locus trees that have low aLRT (approximate Likelihood Ratio Test: a way of asking whether a branch has a length that differs significantly from zero) scores, but we skip that here for simplicity. This analysis will take just a few minutes.

Here we just calculate the species tree, we’ll calculate concordance vectors and branch support values later

astral -i loci.treefile -o astral_species.tree 2> astral_species.log

This analysis will produce two files. For convenience you can download these here: astral.zip

  • astral_species.tree: the species tree estimated from ASTRAL (this might be quite different to the tree in the paper, because we used only 400 genes, not the full set of more than 63000!)
  • astral_species.log: the log file from ASTRAL

Estimating concordance vectors and support values

Now we want to calculate gene, site, and quartet concordance vectors, and posterior probabilities (support values calculated by ASTRAL) for every branch in our species tree. To do that, we need our species tree (of course); our gene trees (gene and quartet concordance vectors are calculated from these); our alignments (site concordance vectors are calculated from these).

Note that concordance factors and support values apply to branches in trees, not nodes.

Estimate the support and quartet concordance vectors in ASTRAL

We use ASTRAL to calculate quartet concordance vectors and posterior support values (which are calculated from the quartet support values, see below for an explanation of both).

  • -q tells ASTRAL to use a fixed tree topology, we use the species tree we calculated above
  • -t 2 tells ASTRAL to calculate all of the things we need and annotate the tree with them
astral -q astral_species.tree -i loci.treefile -t 2 -o astral_species_annotated.tree 2> astral_species_annotated.log

There are two output files here, which you can download here: astral_annotated.zip

  • astral_species_annotated.tree: the species tree with annotations on every branch
  • astral_species_annotated.log: the log file for ASTRAL

The annotated tree contains a lot of extra information on every branch, e.g.:

[q1=0.9130236794171221;q2=0.04753773093937029;q3=0.03943858964350768;f1=334.1666666666667;f2=17.398809523809526;f3=14.43452380952381;pp1
=1.0;pp2=0.0;pp3=0.0;QC=200178;EN=366.0]

These are explained in detail in the ASTRAL tutorial, but for our purposes we are interested in:

  • q1, q2, and q3: form the quartet concordance vector (ASTRAL calls these ‘quartet frequencies’, ‘normalised quartet frequencies’, and sometimes ‘quartet support values’; we argue in our paper that they are very much not support values)
  • pp1: the ASTRAL posterior probability for a branch (roughly, the probability that q1 is the highest of the three q values)

Estimate the gene and site concordance vectors in IQ-TREE

We use IQ-TREE to calculate gene and site concordance vectors (for more details see the concordance factor page).

In the following command lines:

  • -te tells IQ-TREE to use a fixed input tree (note that the tree we pass with -te differs in the two commands: the latter command uses the tree output by the former command, which sequentially adds to the labels on the tree for convenience)
  • --gcf is the command to calculate the gCF using the gene trees we estimated above
  • -prefix is the prefix for the output files
  • -T is the number of threads (change this to suit your machine)
  • --scfl 100 is the command to calculate the likelihood-based sCF with 100 replicates
  • -p loci.best_model.nex tells IQ-TREE to use the loci from bird_400 and the models we estimated previously when calculating the gene trees (this saves a huge amount of time)
# first calculate the site concordance vectors
iqtree2 -te astral_species_annotated.tree -p loci.best_model.nex --scfl 100 --prefix scfl -T 128

# next calculate the gene concordance vectors
iqtree2 -te scfl.cf.tree --gcf loci.treefile --prefix gcf -T 128

# finally we do a dummy analysis in IQ-TREE. The only point of this is to get the branch lengths in coalescent units 
# from the ASTRAL analysis, in a format that is output by IQ-TREE in a convenient table with IQ-TREE branch ID's 
# note the -blfix option, which keeps the original branch lengths - this makes the scfs meaningless, but is here 
# simply to allow us to extract branch lengths in coalescent units frmo the ASTRAL tree in a convenient table
# we set scfl to 1, which saves time given the scfs are already meaningless, never use the sCFs from this analysis!!!
iqtree2 -te astral_species_annotated.tree -blfix -p loci.best_model.nex --scfl 1 --prefix coalescent_bl -T 128

These three command lines will produce a lot of output files, but the key files are:

  • gcf.cf.stat: a table with the gCF values, as well as gDF1, gDF2, gDFP, and many other things (including all the ASTRAL labels)
  • scfl.cf.stat: the equivalent table for scfl values (including all the ASTRAL labels)
  • coalescent_bl.cf.stat: the dummy table from which we’ll get our coalescent branch lengths
  • gcf.cf.tree: the tree file with lots of annotations about concordance factors (plus all the ASTRAL annotations)
  • gcf.cf.branch: the tree file annotated with branch IDs that match those in the .stat files

You can download these files here: stat_and_tree_files.zip

Each internal branch in gcf.cf.tree will be annotated like this:

'[q1=0.570241231975882;q2=0.17602481596980715;q3=0.25373395205431093;f1=207.567808439221;f2=64.0730330130098;f3=92.3591585477691 7;pp1=1.0;pp2=1.575119358351017E-20;pp3=2.952157354351003E-20;QC=8496;EN=364.0]'/25.4/47.0:0.0055956557

This doesn’t contain all the information for the concordance vectors (see below for that) but it’s still very useful:

  • q1=0.570241231975882: this is the quartet concordance factor
  • 25.4: this is the site concordance factor
  • 47.0: this is the gene concordance factor
  • 0.0055956557 this is the branch length in substitutions per site calculated when we calculated the site concordance factors

View the tree file

One useful thing to do is to look at these labels in the context of your species tree. To do this, you can open the file gcf.cf.tree in a tree viewer like DendroScope. Just load the tree in Dendroscope, specify that the labels are edge labels when you are asked, and that’s it. You can then re-root the tree, change the layout, and zoom in and out to see the edge labels you are interested in. However, the edge labels so far don’t contain the full concordance vectors, so we’ll get those next.

Generate the concordance vectors for each branch

The final step of this tutorial is to get the full gene, site, and quartet concordance vectors.

The information we need to calculate these is in two files: gcf.cf.stat and scfl.cf.stat. These are described above, and you can download them above or here: stat_and_tree_files.zip

We’ll use the R script you downloaded to organise these files into concordance vectors:

Rscript concordance_vector.R

This will produce a file called concordance_vectors.csv (you can download a copy here: concordance_vectors.csv), which has gene, site, and quartet concordance vectors, along with branch lengths in units of substitutions per site and coalescent units, and branch IDs which correspond to the gcf.cf.branch tree file.

The first five rows of your csv file should look something like this:

IDgene_psi1gene_psi2gene_psi3gene_psi4gene_psi1_Ngene_psi2_Ngene_psi3_Ngene_psi4_Ngene_Nsite_psi1site_psi2site_psi3site_psi4site_psi1_Nsite_psi2_Nsite_psi3_Nsite_psi4_Nsite_Nquartet_psi1quartet_psi2quartet_psi3quartet_psi1_Nquartet_psi2_Nquartet_psi3_Nquartet_Nquartet_psi1_ppquartet_psi2_ppquartet_psi3_pplength_subs_per_sitelength_coalescent
364838.57.371.132933026435351.5124.7723.7201867.35898859.8603625.210.840.090.07296.4830.2526.273531000.01373410.0531013
36599.46000.5437000237242.5333.6223.850445.8352.4625001048.26100371.990.0103721000.03765580.201927
36679.0310.226.993.7629438261437218.0767.1714.770103.48384.9684.530572.970.810.110.08300.8641.5129.633721000.01000110.0460157
36797.590.2702.1436410837337.8642.9219.220280.21317.69142.280740.180.990.010.01368.142.742.113731000.02306370.138847
36883.871.881.3412.9312754837230.6143.4625.940227.76323.6193.020744.380.890.060.05332.7221.0418.233721000.01167690.0672212

This table has a lot of columns. For easy reference, here’s a description of every column:

ColumnDescriptionCalculated from
IDBranch IDID gcf.cf.branch file
gene_psi1Ψ1 for genes (%)gCF from gcf.cf.stat
gene_psi2Ψ2 for genes (%)Larger gDF1 and gDF2 from gcf.cf.stat
gene_psi3Ψ3 for genes (%)Smaller gDF1 and gDF2 from gcf.cf.stat
gene_psi4Ψ4 for genes (%)gDFP from gcf.cf.stat
gene_psi1_NΨ1 for genes (count)gCF_N from gcf.cf.stat
gene_psi2_NΨ2 for genes (count)Larger gDF1_N and gDF2_N from gcf.cf.stat
gene_psi3_NΨ3 for genes (count)Smaller gDF1_N and gDF2_N from gcf.cf.stat
gene_psi4_NΨ4 for genes (count)gDFP_N from gcf.cf.stat
gene_NNumber of decisive gene treesgN from gcf.cf.stat
site_psi1Ψ1 for sites (%)sCF from scfl.cf.stat
site_psi2Ψ2 for sites (%)Larger sDF1 and sDF2 from scfl.cf.stat
site_psi3Ψ3 for sites (%)Smaller sDF1 and sDF2 from scfl.cf.stat
site_psi4Ψ4 for sites (%)Always zero by assumption
site_psi1_NΨ1 for sites (count)sCF_N from scfl.cf.stat
site_psi2_NΨ2 for sites (count)Larger sDF1_N and sDF2_N from scfl.cf.stat
site_psi3_NΨ3 for sites (count)Smaller sDF1_N and sDF2_N from scfl.cf.stat
site_psi4_NΨ4 for sites (count)Always zero by assumption
site_NNumber of decisive sitessN from scfl.cf.stat
quartet_psi1Ψ1 for quartets (%)q1 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi2Ψ2 for quartets (%)Larger q2 and q3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi3Ψ3 for quartets (%)Smaller q2 and q3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi4Ψ4 for quartets (%)Always zero by assumption
quartet_psi1_NΨ1 for quartets (count)f1 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi2_NΨ2 for quartets (count)Larger f2 and f3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi3_NΨ3 for quartets (count)Smaller f2 and f3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi4_NΨ4 for quartets (count)Always zero by assumption
quartet_NEffective number of gene treesEN from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi1_ppASTRAL posterior probability for Ψ1pp1 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi2_ppASTRAL posterior probability for Ψ2Larger pp2 and pp3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
quartet_psi3_ppASTRAL posterior probability for Ψ3Smaller pp2 and pp3 from scfl.cf.stat ‘Label’ column (calculated in ASTRAL)
length_subs_per_sitebranch length in substitutions per siteCalculated in IQ-TREE using a concatenated analysis
length_coalescentbranch length in coalescent unitsCalcualted in ASTRAL from the quartet concordance vector

Put concordance factors (or other numbers!) on a tree

A common aim is to annotate your tree with the statistics you are interested in. The output tree above has rather unwieldy labels on each branch like this:

'[q1=0.570241231975882;q2=0.17602481596980715;q3=0.25373395205431093;f1=207.567808439221;f2=64.0730330130098;f3=92.3591585477691 7;pp1=1.0;pp2=1.575119358351017E-20;pp3=2.952157354351003E-20;QC=8496;EN=364.0]'/25.4/47.0:0.0055956557

But we can use the tree with branch IDs to put any label on a tree. An example is in the change_labels.R script. As written, this script just updates the branch ID labels in the gcf.cf.branch tree to show the ID and the three concordance factors (the Ψ1 values), each labelled with the first letter of the input data (i.e. g for genes, s for sites, and q for quartets). You can run this script like so:

Rscript change_labels.R

This will output a nexus-formatted tree file called id_gcf_scf_qcf.nex. Each branch on this tree is labelled as follows:

391-g98.54-s84.09-q98.54

The first number is the branch ID, and the next three are the three concordance factors. This can be useful for exploring your data. For example, if you look at the part of the species tree we inferred in this recipe that groups the kiwis (genus Apteryx), you can see that there is a lot of concordance in this part of the tree:

kiwis

The concordance factors tell you a certain amount, but to understand things better, you really need to examine the concordance vectors.

If you want to put different labels on your tree, that is relatively simple to do by editing the change_labels.R script, which you can get from GitHub here: https://github.com/roblanf/concordance_vectors/blob/main/change_labels.R

Generate concordance tables for branches of interest

A concordance table is just a table of the three concordance vectors, as shown in the Lanfear and Hahn paper. The concordance_table.R script lets you generate a concordance table for any branch, based on the branch ID. Here we’ll do that for two branches that were recovered in the original Nature paper, discussed in Lanfear and Hahn, and also recovered in the ASTRAL tree we estimated here from 400 loci (I found the branch IDs for these branches by studying the tree labelled with branch IDs that I made above):

  • Branch 598: the Palaeognathae (kiwis and other cool birds)
  • Branch 545: the Telluraves (passerines and other closely related groups)

The concordance_table.R script takes two input variables:

  • the concordance_vectors.csv file we generated above
  • the branch ID

So to get the tables for our two branches, we run it once for each as follows:

Rscript concordance_table.R concordance_vectors.csv 598
Rscript concordance_table.R concordance_vectors.csv 545

The output includes 2 files for each run:

  • a PDF of the table, e.g. concordance_table_598.pdf
  • a CSV file of the table, e.g. concordance_table_598.csv

The CSV looks like this (using Telluraves as an example):

typepsivaluelower_CIupper_CI
gene15.603.5623417.888041
gene20.250.0000000.769720
gene30.000.0000000.000000
gene494.1591.85750696.183206
site145.6342.51889148.434787
site218.7016.49021020.934787
site316.0013.71063018.320910
site419.6717.38302521.849787
site50.000.0000000.000000
site60.000.0000000.000000
site70.000.0000000.000000
site80.000.0000000.000000
site90.000.0000000.000000
site100.000.0000000.000000

And the PDF looks like this:

concordance_table_545_telluraves

Clearly there is substantial discordance around this branch! Compare this to the palaeognathae, which have far less discordance:

concordance_table_598_palaeognathae

These tables allow you to quickly dig into the concordance vectors on any given branch in your tree.

Confidence intervals on the concordance vectors

You’ll notice that the tables include 95% confidence intervals for the concordance and discordance factors. These are calculated using 1000 bootstraps of the count data, and provide useful context for interpreting the values, and particularly for interpreting potential differences in the values.

These bootstrap confidence intervals are calculated by resampling from the counts for each concordance vector. The total sample size for each category (genes, sites, and quartets) is shown on the table underneath the y axis label. Note that for sites and quartets, the counts are not always whole numbers because of how they are calculated. This can also mean that the bootstrap confidence intervals can be a little off for very low counts, because the numbers have to be rounded to integers in order to calculate them.