Ktreedist

Calculation of the minimum branch length distance (K tree score) between phylogenetic trees

Version 1.0 || June 2007
Copyright © Víctor Soria-Carrasco & Jose Castresana

Institut de Biologia Evolutiva (CSIC-UPF)
Passeig Marítim de la Barceloneta 37, 08003 Barcelona, Spain


Ktreedist is a computer program written in Perl scripting language that calculates the minimum branch length distance (or K tree score) from one phylogenetic tree to another. The K tree score provides a measure of the difference in both topology and branch lengths between two trees after scaling one of them to have a global divergence as similar as possible to the other tree.

The program should run in any UNIX / UNIX-like platform (Linux, Mac OS X, etc.) or Windows with a Perl interpreter installed.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but without guarantee of support or maintenance.


Contents


Installation

The only requirement is to have a Perl 5.8.x interpreter installed in your operating system.

This program is distributed as a compressed tar file (.tar.gz), so you will need to unpack it and uncompress it. On UNIX / UNIX-like environments execute:

tar xvfz Ktreedist_v1.tar.gz
On Windows you can use a tool like 7-zip to do it.

You will get the program Ktreedist.pl, a folder with documentation and a folder with several example files.

Quick start

You can calculate the K tree score from one tree to another with the command:

Ktreedist.pl -rt [reference tree FILE] -ct [comparison tree/s FILE]

For example, you can execute from the '/examples/small/nexus/' directory:

Ktreedist.pl -rt A_reference.tree -ct B_comparison.tree

In UNIX systems, if you have not made the file Ktreedist.pl executable, you must write "perl" before the program name. In any case, you should get something like this:

Ktreedist version 1.0 - June 2007
Calculation of the minimum branch length distance (K tree score) between trees

Checkings:
Line break... OK (UNIX & UNIX)
File format... OK (nexus with taxa block & nexus with taxa block)
Number of reference trees... OK
Checking branch lengths... OK
Tips in trees... OK
All trees rooted or all unrooted... OK

K tree score calculation:
'A_reference.tree' vs 'B_comparison.tree' (4 trees)
Reference tree (set A) is rooted and has 7 tips and 12 partitions.
Number of partitions on comparison tree/s are listed below.
---------------------------------------------
  Comparison_tree    K-score   Scale_factor  
---------------------------------------------
  set B              0.12076      1.02781   
  set E              0.23592      0.46370   
  set H              0.09170      0.87013   
  set X              0.26704      0.31108   
---------------------------------------------

In this example, the comparison tree/s file contained four trees, and the K score is calculated from the reference tree to each comparison tree. The first column of the output contains the names of the comparison trees (in case of using the nexus format) or the order number of these trees (in case of using the newick format). The second column contains the minimum branch length distance or K tree score. The third column contains the factor 'K' by which the corresponding tree was scaled.

How the method works

In order to calculate the K tree score, the program works in two steps. First, using a novel approach, the program calculates the scale factor that approximates as much as possible the global divergence of a comparison tree to that of a reference tree; then it scales the comparison tree by this factor K. Second, the program calculates the branch length distance (BLD) between the scaled comparison tree and the reference tree; thus the K tree score is the minimum branch length distance you can get from one tree to another after scaling one of them. The BLD was introduced by Kuhner and Felsenstein in 1994 (Mol Biol Evol 11:459-468; see also the book of Felsenstein, 2004), and it is implemented in the Phylip package, specifically in the treedist program (to which we added the 'K' to name our program). However, the BLD depends on the absolute size of the trees being compared and thus it cannot be directly used with trees of different substitution rates (Kuhner and Felsenstein, 1994). The K tree score tries to overcome this problem by introducing the K scale factor. On the other hand, it should be taken into account that the K tree score is not symmetric, that is, the result from A to B may not be the same than from B to A, and, in consequence, the K score is not properly a distance. Be aware that the results depend on the order of the input files.

Available options

If you type Ktreedist.pl -h or simply Ktreedist.pl in the command line, help will be displayed:

Usage:
 Ktreedist.pl -rt <reference tree file> -ct <comparison tree/s file> [<options>]
    Options:
        -t [<filename>]  Output file for table of results
        -p [<filename>]  Output file for table of partitions
        -s [<filename>]  Output file for comparison tree/s after scaling
        -r  Output symmetric difference (Robinson-Foulds)
        -n  Output number of partitions in the comparison tree/s
        -a  Equivalent to all options

Options can be entered in any order, before, between or after specifying the reference or comparison files.

-rt <reference tree file>

'-rt' stands for reference tree. This is the file that contains the tree to which you want to compare the comparison tree/s. This file can be in newick or nexus format.

-ct <comparison tree/s file>

'-ct' stands for comparison tree/s. This is the file that contains the tree or the set of trees you want to compare to the reference tree. They will be scaled to match as much as possible the reference tree. This file can be in newick or nexus format.

Options:

-t [<filename>]  Output file for table of results

A file containing a table of results is generated. If no filename is given the program will create a file with the name of the comparison tree/s and the extension .tab. This file can be easily imported to any spreadsheet. It has the following structure:

Trees
K-score
Scale_factor
Symm_difference
N_partitions_ref_tree
N_partitions_comp_tree
set A (ref) - set B
0.12076
1.02781
2
12
12
set A (ref) - set E
0.23592
0.46370
2
12
12
set A (ref) - set H
0.09170
0.87013
0
12
12
set A (ref) - set X
0.26704
0.31108
5
12
11

The 'Symm_difference' column is generated only with the option -r. The 'N_partitions' columns are generated only with the option -n (see below).

-p [<filename>]  Output file for table of partitions

A file containing a table of partitions for each comparison tree is generated. If no filename is given the program will create a file with the name of the comparison tree/s and the extension .part. The file has the following structure (for one of the comparison trees):

set X
Brl_ref_tree
Brl_comp_tree
Brl_comp_tree_K
Partition
0.02122
NONE
NONE
(sp1, sp2, sp3, sp4, sp5, sp6) / (sp7, root)
0.03865
NONE
NONE
(sp1, sp2, sp3, sp4) / (sp5, sp6, sp7, root)
0.03156
0.03865
0.01202
(sp2, sp3, sp4) / (sp1, sp5, sp6, sp7, root)
0.03840
NONE
NONE
(sp2, sp3) / (sp1, sp4, sp5, sp6, sp7, root)
0.14233
0.14233
0.04427
(sp5, sp6) / (sp1, sp2, sp3, sp4, sp7, root)
0.10262
0.10262
0.03192
(sp1) / (sp2, sp3, sp4, sp5, sp6, sp7, root)
0.02572
0.02572
0.00800
(sp2) / (sp1, sp3, sp4, sp5, sp6, sp7, root)
0.03914
0.03914
0.01218
(sp3) / (sp1, sp2, sp4, sp5, sp6, sp7, root)
0.10829
0.10829
0.03369
(sp4) / (sp1, sp2, sp3, sp5, sp6, sp7, root)
0.04519
0.04519
0.01406
(sp5) / (sp1, sp2, sp3, sp4, sp6, sp7, root)
0.05906
0.05906
0.01837
(sp6) / (sp1, sp2, sp3, sp4, sp5, sp7, root)
0.22260
0.24382
0.07585
(sp7) / (sp1, sp2, sp3, sp4, sp5, sp6, root)
NONE
0.35872
0.11159
(sp2, sp3, sp4, sp5, sp6) / (sp7, sp1, root)
NONE
0.31628
0.09839
(sp7, sp1) / (sp2, sp3, sp4, sp5, sp6, root)

Meaning of each column:
Brl_ref_tree: Branch length of this partition on the reference tree.
Brl_comp_tree: Branch length of this partition on the original comparison tree.
Brl_comp_tree_K: Branch length of this partition on the comparison tree after scaling.
Partition: Tip nodes that constitute this partition.

For rooted trees, clades that encompass the root include this node together with the tip nodes.

-s [<filename>]  Output file for comparison tree/s after scaling

A file containing the scaled comparison tree/s is generated. If no filename is given the program will create a file with the name of the comparison tree/s and the extension .scaled. The format of this file (newick or nexus) will be the same than that of the comparison file.

-r Output symmetric difference (Robinson-Foulds)

The symmetric difference is displayed in a column called 'Symm_difference' and, if the option -t is invoked, also in the table of results. The symmetric difference is defined as the number of partitions that are not shared between two trees, that is, the number of partitions of the first tree that are not present in the second tree plus the number of partitions of the second tree that are not present in the first tree. This distance may not be an even number if there are polytomies. Partitions with zero branch length (soft polytomies) count for the symmetric difference calculation by default, but there is a hidden option (see program code) to collapse such partitions for the symmetric difference calculation. The number of partitions (option -n) are counted accordingly. The K tree score is the same with or without collapsing soft polytomies.

-n Output number of partitions in the comparison tree/s

The number of partitions in the comparison tree/s are displayed in a column called 'N_partitions' and, if the option -t is invoked, also in the table of results (in this case, together with the number of partitions in the reference tree). The knowledge of the number of partitions may be useful to detect trees with polytomies.

-a Equivalent to all options

All previous options are enabled.

File formats

Input files can be in newick or nexus format. The nexus format can have a taxa block or not.

Trees may have bootstrap values. They will be read and ignored for the calculations, but written back to the scaled trees file.

The type of line break is also considered. The output files will be returned with the same type of line break present in the comparison file. Be careful with this, specially if you are working with input files with a line break different of that of your OS (ie. UNIX line break in a Windows system): you may have problems when postprocessing output files from this program.

The program is supposed to run with one reference tree and one or several comparison trees. If the reference file contains more than one tree, only the first one will be used.

Trees must be all rooted or all unrooted.

Citation