Phylogeny Tutorial¶

Course: Applied Bioinformatics (BBT045)

Teacher: Vi Varga

Last Updated: 24.02.2025

Introduction¶

In this exercise you will practice your newly acquired skills in phylogenetic inference and "tree thinking", by analyzing the evolutionary history of a gene family.

You will (roughly) follow the phylogenetic analysis workflow discussed during the lecture, starting with collecting the data necessary for running the analysis (in the form of homologous protein sequences from different species), through the interpretation of results (i.e., comparing species and gene trees).

For this tutorial, you will be working with a protein from Trichomonas vaginalis. T. vaginalis is a parasitic protist that causes the sexually transmitted infection Trichomoniasis. While infection is generally asymptomatic, complications can include up to infertility or sterility. This particular protist belongs to the eukaryotic supergorup Metamonada, which is comprised of 4 primary phyla (Anaeramoebidae, Parabasalia, Preaxostyla and Fornicata) and contains a wide variety of parasites, commensals, and free-living organisms.

Please download this tutorial as a Jupyter Notebook from here, using, for example, wget.

Setting up the environment¶

(BONUS) Creating the `conda` environment & container¶

In this section, I will describe how the container and conda environment were created. *YOU DO NOT NEED TO REPLICATE THIS SECTION - IT IS FOR YOUR INFORMATION ONLY!!!*

The container that you will be using to run this Jupyter Notebook already exists, and the next section will contain an explanation for how to use it.

This section is for informational purposes only!**

Before you start any new project, it's a good idea to set up a new conda environment in which you can install software for use on the command line. In general, it is a good idea to get into the habit of using conda environments or containers, and activating the relevant environment directly after opening the terminal. This way, you won't accidentally try to run software that isn't installed in your base conda environment.

It's also good practice to double-check the environment you have activated prior to installing any new software with conda. Un-installing programs that you've accidentally installed takes much, much, *much* longer than you'd expect! (Think, "need to leave the computer running overnight" kinds of situations. (。_。) )

When should you create a new conda environment? Generally, it's a good idea to have a dedicated conda environment for any project you're working on. That way, at the end of the project, you can synthesize all the information related to version numbers of the programs you used quite quickly. However, you may at times need to create environments for specific programs - this is particularly common for older programs that may require older-than-standard versions of some dependency packages, particularly programming language versions (ex.: programs written with Python 2.X will not be compatible with a conda environment running Python 3.X, and visa versa).

*Reminder: YOU DO NOT NEED TO REPLICATE THE FOLLOWING STEPS!!!*

You can create a conda environment like so:

# create the environment on the command line
conda create -n phylo-tutorial-env-2025
# feel free to name your environments whatever you like
# just try to make sure your name is descriptive, so you can remember what it was for
# You'll see a lot of scrolling text, and then need to confirm creation of the environment with "y"
# once setup is done, activate the environment: 
conda activate phylo-tutorial-env-2025
# you can deactivate the environment later with: 
conda deactivate
# remember not to use that above command in your base conda environment!

With the conda environment set up, you can install the relevant software. Below, I demonstrate the software installations I used while setting up this tutorial. Note that in some plaecs I use mamba instead of conda. The mamba program is basically an ultra-fast version of conda (it needs to be seperately installed). It's much quicker and more effective at resolving dependencies when installing programs.

# install programs
# start with Python- and Jupyter-related programs
mamba install python matplotlib numpy scipy pandas seaborn jupyterlab jupyter biopython
# install bioinformatics tools used in the pipeline
mamba install -c bioconda mafft iqtree trimal cialign alen

I then export the conda environment to a file, and build a container from it:

# make sure you are in your base conda environment for this
# creating a YAML file of the conda environment
conda env export -n phylo-tutorial-env-2025 > phylo-tutorial-env-2025.yml
# using Apptainer to build the conda environment in a container
apptainer build --build-arg ENV_FILE=phylo-tutorial-env-2025.yml phylo-tutorial-env-2025.sif conda_environment_args_ubuntu.def
# run the Jupyter Notebook from the container from the command line like so:
apptainer exec phylo-tutorial-env-2025.sif jupyter lab

(Mandatory) Running the container¶

From this point on, the information in this tutorial actually should be followed. Here, you'll find instructions on how to use the container created for this tutorial and exercise, by using the phylogeny-jupyter.sh runtime script located at /cephyr/NOBACKUP/groups/bbt045_2025/Resources/Phylogeny/.

Similarly to what you did for the Sequencing Technologies tutorial, you need to copy the phylogeny-jupyter.sh runtime script to your ~/portal/jupyter/ directory. You should then run this tutorial (and the associated exersize) using this runtime environment using Jupyter via the Vera OnDemand platform. Note that it might be worth requesting multiple cores (2-4) to allow parallelization of the IQ-TREE program execution, in particular.

Setting up your directory system¶

Please make a directory on Vera in which you will store your files for this tutorial, as well as the exercise to follow. You would be surprised how swiftly the number of files you're using gets out of hand, so try to develop good habits from the beginning! For example, it's good practice to have a bin/ directory in your home directory, where you store executable files and the like for programs that you cannot simply install via conda.

# in the directory where you have your files for Applied Bioinformatics 
# for ex.: a directory named AppliedBioinfo/
mkdir PhyloWorkflow/
cd PhyloWorkflow/

Above, I used my personal preferred naming convention, but feel free to use whatever file names you wish, as long as they're descriptive.

Obtaining & Exploring Data¶

Data Location¶

All data files that you will need to run this tutorial can be found in the /cephyr/users/viragv/Vera/BBT045-2025/Resources/Phylogeny/ directory. Materials for the tutorial can be found in the Tutorial/ subdirectory, while materials for the homework can be found in the Homework/ subdirectory.

Please copy the files necessary to run the tutorial and exercise to your working directory now, like so:

cp -r /cephyr/NOBACKUP/groups/bbt045_2025/Resources/Phylogeny/* PATH/TO/YOUR/PhyloWorkflow/

Preliminary data exploration¶

For this tutorial, we will be using the XP_001322682.1__Tvag.fasta file. Take a look!

What kind of FASTA file is it?

What is the protein ID?

This is a RefSeq protein, which means it is considered good quality. To quote the NCBI, the RefSeq database is a "comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein." (Source: https://www.ncbi.nlm.nih.gov/refseq/)

In T. vaginalis, XP_001322682.1 is predicted to function in the cytosol, though paralogs of the protein in T. vaginalis are known to function in the mitochondrion (Smutná et al. 2022)[^1].

[^1]: Technically speaking, T. vaginalis doesn't actually have a mitochondrion, per se. T. vaginalis and all other organisms within supergroup Metamonada have "Mitochondrion-Related Organelles" (MROs), which are extremely functionally reduced mitochondria. Monocercomonoides exilis, a member of supergroup Metamonada, is actually the only known eukaryote to completely lack a mitochondrion!

Finding homologous sequences¶

In order to find homologs of this gene to use in a gene tree, we're going to use NCBI BLAST. While command-line BLAST technically exists, overall it is much simpler and faster to use the web browser version. This is particularly the case as the conda installation of NCBI BLAST doesn't work very well.

What type of BLAST do we need to run? Explain your reasoning.

From the NCBI BLAST homepage (https://blast.ncbi.nlm.nih.gov/Blast.cgi) select the appropriate BLAST algorithm.

You can BLAST the sequence in one of two ways: either you can copy the sequence into the search box, or you can use the gene name (XP_001322682.1). Since this is the name of the protein in the NCBI database, it is possible to search for BLAST hits using only the protein ID - if this was a protein from an organism you sequenced, with no official name in the NCBI database, you would only have the option of performing a sequence-based BLAST.

Go ahead and BLAST the gene, and maybe take a short coffee break. BLAST can sometimes take a minute.

ヾ(＠⌒ー⌒＠)ノ

The first hit should be our protein - go ahead and check! Do you notice anything about the quality of the BLAST hits?

In the light green bar above the search results labeled "Sequences producing significant alignments" you'll find a "Download" drop-down menu. Select "FASTA (complete sequence)" and you'll download a file named seqdump.txt. Rename this file to something meaningful (ex.: XP_001322682.1__MSAprep.fasta) and move it to your Tutorial/ working directory. You can do this using copy/paste into nano, an SSH file transfer program like FileZilla, or scp (scp XP_001322682.1__MSAprep.fasta CID@vera1.c3se.chalmers.se:/FULL/PATH/TO/WORKING/DIRECTORY where you can fill in your username and the path to your working directory).

How many sequences are in the file?

Cleaning the data¶

Generally, when you compile protein sequences to use in an analysis, you want to clean the data in some way. A few common data transformations include: capitalizing all letters in sequence lines, editing header lines, removing non-standard characters from sequence lines, and conversion between multi-line and single-line FASTA format.

Why might it be important to clean the data in this way?

For the sake of time, I've prepared a cleaning script using Python (clean_MSA_seqs.py). You can use it from the command line like so:

# model usage: 
python clean_MSA_seqs.py input_fasta

In [ ]:

# Adapt the command above to your file names!
# remember that this cell will assume you are running Python commands - you need to tell it you're running something on the command line

I tend to comment my code pretty thoroughly, so it should be quite readable, but let me know if you have any questions. If you're up for a challenge, see if you can write something like this yourself! (But finish the tutorial first - come back to this later if you still have the time.)

Compare the original and cleaned files. Do they contain the same number of sequences? The same number of characters? What changed, if anything? Why do you think this is?

Since our sequences are now cleaned, we can move on to generating the alignment.

Multiple Sequence Alignment¶

Generating the MSA¶

In order to generate the MSA, we're going to be using the MAFFT software. You can find more information on this program at the links below:

Homepage: https://mafft.cbrc.jp/alignment/software/
Manual page: https://mafft.cbrc.jp/alignment/software/manual/manual.html

Take a look at the MAFFT manual page. Which algorithm do you think best suits our purposes? Why?

Now, go ahead and create the MSA:

In [ ]:

! mafft --localpair --maxiterate 1000 --amino XP_001322682.1__MSAprep_CLEAN.fasta > XP_001322682.1__MSA.fasta
# Adapt the command above to your file names!
# `--localpair --maxiterate 1000` tells MAFFT to use the L-INS-i algorithm
# `--amino` tells MAFFT that the input is a protein FASTA

Viewing the MSA¶

There are many different tools that you can use to view an MSA. I've provided a few examples below:

Web-based tools:
- As with most aspects of bioinformatics, there are tools available on the web in order to view MSAs. As is often the case with web-based programs, though, their scope is rather limited (especially for the tree programs).
- EMBL-EBI MView: EMBL-EBI provides a web-based tool where you can upload an MSA, and see the results. Access it from here: https://www.ebi.ac.uk/jdispatcher/msa/mview
- NCBI Multiple Sequence Alignment Viewer: The NCBI provides a web-based MSA viewer, which you can access from here: https://www.ncbi.nlm.nih.gov/projects/msaviewer/
Stand-alone software:
- Software designed for phylogenomics analysis provides far more flexibility than web-based tools, though this of course comes with the trade-off of requiring installation, and taking up space on your hard drive.
- AliView: (My personal favorite) This program from Uppsala Univeristy provides smooth viewing and editing of MSAs. Find more information on it here: https://ormbunkar.se/aliview/
- MEGA-11: The MEGA software suite allows a huge range of phylogenomics analysis tools. You can create MSAs, edit alignments, visualize phylogenetic trees, perform bootstrap testing... All from within a GUI window! Find it here: https://www.megasoftware.net/
- CIAlign: This software suite works from the command line, and is installed in the phylogeny container. It can be used to view (portions of) and edit MSAs.
  - See the GitHub page here: https://github.com/KatyBrown/CIAlign
  - See the documentation here: https://cialign.readthedocs.io/en/latest/
- alen: A simple command-line MSA viewer, installed in the container. See the GitHub page here: https://github.com/jakobnissen/alen

For now, please use alen on Vera! The command is interactive, so you'll need to run the following in a new terminal you open in Jupyter Lab, rather than from a cell in your Jupyter Notebook:

alen XP_001322682.1__MSA.fasta

Do you notice any patterns?

Creating the tree¶

We will be using the IQ-TREE software to generate the gene tree. You can find more information about this program at the links below:

Homepage: http://www.iqtree.org/
Manual: http://www.iqtree.org/doc/iqtree-doc.pdf

Go ahead and run the command you see below - there will be a lot of text printed to the screen, but don't worry about redirecting it to a file to look at later, because all of it will also be printed to the log file generated automatically by IQ-TREE.

Note the -ntmax 8 flag and its meaning: The way the command below is written, it assumes you have access to 8 cores. If you started the Vera OnDemand session with the default 1 cores, I recommend you remove that flag and change -T AUTO to -T 1.

This will take a few minutes. So feel free to grab a coffee, take a short break!

♪(^∇^*)

Once you're ready, feel free to read the little chunk of text below the code block here - it'll provide a little more information this type of analysis.

In [ ]:

! iqtree -s XP_001322682.1__MSA.fasta --prefix XP_001322682.1__MSA_IQ -m LG+I+R5 -seed 12345 -wbtl -T AUTO -ntmax 8
# Adapt the command above to your file names!
# -s is the option to specify the name of the alignment file that is always required by IQ-TREE to work.
# -m is the option to specify the model name to use during the analysis. 
# The special MFP key word stands for ModelFinder Plus, which tells IQ-TREE to perform ModelFinder 
# and the remaining analysis using the selected model.
# Here, the model to use has been pre-selected: LG+R5
# To make this reproducible, need to use -seed option to provide a random number generator seed.
# -wbtl Like -wbt but bootstrap trees written with branch lengths. DEFAULT: OFF
# -T AUTO: allows IQ-TREE to auto-select the ideal number of threads
# -ntmax: set the maximum number of threads that IQ-TREE can use

Note that a typical tree-finding process is quite a bit longer than what you did here. IQ-TREE has a specific argument -m MFP that calls a process called Model Finder Plus which tests many, many different tree models, and finds the one that best fits the data. (Don't worry about what these models are - that's beyond the scope of this class. Suffice to say, it's complicated statistics.) I ran this analysis with -m MFP while preparing this exercise, and even for such a small dataset (only 100 sequences), the process took roughly 2 hours! Clearly, not something we could all do together in class. Tree finding is a complex, computationally demanding process, but is a crucial part of phylogenetic reconstruction, and not the step where you should try to spare CPU hours.

Visualizing Trees¶

There are many different tools that you can use to visualize a phylogenetic tree. I've provided a few examples below:

Web-based tools:
- As with most aspects of bioinformatics, there are tools available on the web in order to visualize phylogenetic trees. As is often the case with web-based programs, though, their scope is rather limited (especially for the tree programs).
- ETE Toolkit: The ETE Toolkit is available as a Python package, but they also have a web server where you can visualize your trees, here: http://etetoolkit.org/treeview/
- NCBI Tree Viewer: The NCBI provides a web-based phylogenetic tree viewer, which you can access from here: https://www.ncbi.nlm.nih.gov/tools/treeviewer/
Stand-alone software:
- Software designed for phylogenomics analysis provides far more flexibility than web-based tools, though this of course comes with the trade-off of requiring installation, and taking up space on your hard drive.
- FigTree: (My personal favorite) This program allows you to open trees and edit components of its visualization, before exporting in a variety of different file types (PNG, JPEG, SVG, etc.). It's a JAVA-based application, so if you have Java installed on your computer, no further installation processes will be necessary to open FigTree. Find it here: http://tree.bio.ed.ac.uk/software/figtree/
- MEGA-11: The MEGA software suite allows a huge range of phylogenomics analysis tools. You can create MSAs, edit alignments, visualize phylogenetic trees, perform bootstrap testing... All from within a GUI window! Find it here: https://www.megasoftware.net/
Packages built for bioinformaticians:
- There are plenty of packages/libraries available for the visualization of phylogenetic trees, built to work with the programming languages most used by bioinformaticians: Python and R. These editing tools have a higher learning curve, since you need to code to change aspects of the tree, but they also allow far more flexibility than either web-based tools or stand-alone software.
- Python:
  - The ETE Toolkit (mentioned above) is actually primarily a Python package. Find it here: http://etetoolkit.org/
  - Biopython is a whole suite of Python packages for bioinformattics analysis, so of course, they have their own package for working with phylogenetic trees, Phylo. Find it here: https://biopython.org/wiki/Phylo
- R (this course doesn't use R, but it's a fantastic language for visualization):
  - The ape library in R can be used to visualize and edit phylogenetic trees. It can be installed the usual way (install.packages(ape)). The creators of the package have provided a great manual, which you can find here: https://cran.r-project.org/web/packages/ape/ape.pdf
  - The ggtree library was created by Bioconductor, which provides a suite of R tools for bioinformatics analysis. The program is built to work like ggplot2, except for trees. You can find more information (including installation instructions, which are a bit different for Bioconductor packages) here: https://bioconductor.org/packages/release/bioc/html/ggtree.html

Feel free to explore these programs and packages at your leisure, and find what works best for you. For now, for the sake of time, I have written a Python script using the Biopython Phylo module that you can use to visualize your results, named visualize_PhyloTree_base.py. Fill in the file name and path to your files, and you should be good to go! (Hint: If these instructions aren't clear, take a look at the documentation within the script!)

For the input file for this script, use the FILENAME.treefile file output by IQ-TREE. This file contains the phylogenetic tree generated from the MSA in NEWICK format. A Newick tree is a 1-line simple text representation of a phylogenetic tree, that should be recognized by any phylogenetic tree visualization software.

In [ ]:

Take a look at the tree that you have generated. What do you notice? Are there any interesting patterns?

Comparing trees¶

At this stage of the tutorial, two paths are available: familiarizing yourself with some web tools, or performing alignment editing via the command line. Either way, the goal of this final section is to compare trees generated in different ways. Please read through the descriptions of these options, and select the one that is right for you.

Please perform at least two (2) tree filtrations.

Web Tools:
- Pick this option if you have <15 minutes left until we look over the answers together, and/or if you are struggling with this material a bit.
- This option is less technically demanding, which means you can spend more of your time working with the concepts.
- You will have the opportunity to test some web-based tools for phylogenetic analysis.
Alignment editing:
- Pick this option if you feel confident about the conceptual parts of tree evaluation, and you have >15 minutes left until we go through the exercise.
- You will try your hand at editing alignments, in order to improve the gene tree.

Once you have completed your selected option and generated a new tree, come back to this point in the Jupyter Notebook and answer the question below.

How does the cleaned tree compare to the original version?

Web tools for phylogeny¶

Every passing year, bioinformatics becomes a larger and more significant part of biology. This of course presents some problems for those biologists that were trained as biologists, and not bioinformaticians. Dry and wet lab skills are not the same, and not everyone has the time or means to learn a vastly different set of skills.

Fortunately, as is often the case with bioinformatics, tools are freely-available on the web that allow you work with phylogenomic data. These tools are generally more limited in scope than their command-line counterparts - they have more limitations regarding dataset size and the degree to which you can fine-tune your search/request. However, they are still a great tool even for bioinformaticians, if the query is straightforward and involves a small dataset. Sometimes it's simply simpler and easier to press 3 buttons than to write an entire script.

Sometimes these tools are available as stand-alone programs from the same organization that made the software. MAFFT, for example, allows you to create an MSA on their website, here: https://mafft.cbrc.jp/alignment/server/

For our purposes, however, we will be using the collection of MSA software made available by EMBL-EBI, here: https://www.ebi.ac.uk/jdispatcher/msa

As you can see, a variety of aligners are available on this website, including (but certainly not limited to): Clustal Omega, MAFFT and MUSCLE, three of the most commonly-used MSA softwares. Note that, quite usefully, these programs all generate not just an alignment, but a visualized tree for you, too. Feel free to download a Newick tree file and visualize it if you wish, but it's also fine to just look at the ones the website generated for you.

Try out at least two aligners on the website (whichever ones strike your fancy), and compare the results to the trees we made with MAFFT on Vera.

Editing the MSA¶

As we discussed during the lecture, cleaning up an MSA is an important part of a phylogenetic analysis workflow. Test out some of the strategies we discussed on the MSA you made, and see if anything changes!

In order to edit an MSA, you have two options:

Install AliView on your local computer (not the server). This program will allow you to examine and edit alignments manually. Find it here: https://ormbunkar.se/aliview/
Use a command-line a MSA editing software. The two options installed in the Phylogeny container you are running are: TrimAl (http://trimal.cgenomics.org/introduction) and CIAlign (https://cialign.readthedocs.io/en/latest/pages/introduction.html). If you choose this option, please make sure to visualize the MSAs you create in a web browser, so that you can see for yourself the differences in the alignment.

Note that while Option 1 does require you to install software, it is more interactive, and gives you far more control over your workflow.

AliView¶

Install AliView by following the instructions for your operating system, at: https://ormbunkar.se/aliview/

Then do the following:

Open the program
Navigate: File → Open File → Nagivate to and select your MSA to open it in the program
Turn on Edit Mode: Edit → Edit mode (should have a check mark if edit mode is turned on)
Select portions of the alignment to remove: Select & drag your cursor along the position numbers at the top → Edit → Delete selected
- You can also try a variety of different editing options within the Edit menu (ex.: Delete gap-only columns)
Save the new MSA to a new file with: File → Save as Fasta
Visualize the gene tree again with the new MSA, and compare it to the species tree and other gene tree(s). What has changed (if anything)?

Editing using command-line programs¶

Use the options found in the program manuals to play around with editing the alignments.

TrimAl command line usage manual: https://trimal.readthedocs.io/en/latest/usage.html
CIAlign command line usage manual: https://cialign.readthedocs.io/en/latest/pages/usage.html

In [ ]:

%%bash
# write your trimming code here

In [ ]:

%%bash
# rerun IQ-TREE for trimmed MSA(s)

In [ ]:

%%bash
# visualize the new trees

View your files here, by editing the file names/paths:

Original Tree: Phylogenetic tree from original MSA

MSA edited with AliView: Phylogenetic tree from MSA cleaned by PROGRAM NAME

Bootstapping¶

If you look carefully at the IQ-TREE command options, you'll notice that one of the arguments (-B 1000) tells IQ-TREE to run the tree generation with bootstrapping.

What is bootstrapping/bootstrap support? Why does it matter?

It's possible to display bootstrap values on phylogenetic trees with the Bio.Phylo.draw() command. If you have the time and interest, take a look at the visualize_PhyloTree_base.py script, and modify it to add this feature. (Remember to also modify the output file name, so you don't overwrite your previous files!) Save the modified version of the file as visualize_PhyloTree_bootstrap.py, and run it on the treefiles that have bootstraps. If you don't have the time to make this modification now, you can use the visualize_PhyloTree_bootstrap.py script provided to you. In this case, I would recommend you take a look at the prepared script, and see if you can understand the changes made!

In [ ]:

%%bash
# visualizing the bootstrapped trees

What do you observe, looking at the bootstrapping? Are there any notable patterns?

Summing up¶

We will discuss these questions together at the end. This cell is included in this document for you to consider them, and to take notes on the class discussion.

How well does your gene tree match the species tree of supergroup Metamonada?

Is this a good gene to use to reconstruct the phylogeny of these species? Why or why not?

Citations¶

Phylogeny Tutorial¶

Introduction¶

Setting up the environment¶

(BONUS) Creating the conda environment & container¶

(Mandatory) Running the container¶

Setting up your directory system¶

Obtaining & Exploring Data¶

Data Location¶

Preliminary data exploration¶

Finding homologous sequences¶

Cleaning the data¶

Multiple Sequence Alignment¶

Generating the MSA¶

Viewing the MSA¶

Creating the tree¶

Visualizing Trees¶

Comparing trees¶

Web tools for phylogeny¶

Editing the MSA¶

AliView¶

Editing using command-line programs¶

Bootstapping¶

Summing up¶

Citations¶

(BONUS) Creating the `conda` environment & container¶