Phylogeny Tutorial¶
Course: Applied Bioinformatics (BBT045)
Teacher: Vi Varga
Date: 20.02.2023
Introduction¶
In this exercise you will practice your newly acquired skills in phylogenetic inference and "tree thinking", by analyzing the evolutionary history of a gene family.
You will (roughly) follow the phylogenetic analysis workflow discussed during the lecture, starting with collecting the data necessary for running the analysis (in the form of homologous protein sequences from different species), through the interpretation of results (i.e., comparing species and gene trees).
For this tutorial, you will be working with a protein from Trichomonas vaginalis. T. vaginalis is a parasitic protist that causes the sexually transmitted infection Trichomoniasis. While infection is generally asymptomatic, complications can include up to infertility or sterility. This particular protist belongs to the eukaryotic supergorup Metamonada, which is comprised of 4 primary phyla (Anaeramoebidae, Parabasalia, Preaxostyla and Fornicata) and contains a wide variety of parasites, commensals, and free-living organisms.
Please download this tutorial as a Jupyter Notebook from here, using, for example, wget
.
Setting up the environment¶
(BONUS) Creating the conda
environment & container¶
In this section, I will describe how the container and conda
environment were created.
YOU DO NOT NEED TO REPLICATE THESE STEPS!!! The container that you will be using to run this Jupyter Notebook already exists, and the next section will contain an explanation for how to use it.
*This section is for **informational purposes only!***
Before you start any new project, it's a good idea to set up a new conda
environment in which you can install software for use on the command line. If you open multiple terminal windows while working, please make sure to activate the conda
environment in each one. In general, it is a good idea to get into the habit of using conda
environments or containers, and activating the relevant environment directly after opening the terminal (or, directly after logging on to the server). This way, you won't accidentally try to run software that isn't installed in your base conda
environment.
It's also good practice to double-check the environment you have activated prior to installing any new software with conda
. Un-installing programs that you've accidentally installed takes much, much, much longer than you'd expect! (Think, "need to leave the computer running overnight" kinds of situations. (。_。) )
When should you create a new conda
environment? Generally, it's a good idea to have a dedicated conda
environment for any project you're working on. That way, at the end of the project, you can synthesize all the information related to version numbers of the programs you used quite quickly. However, you may at times need to create environments for specific programs - this is particularly common for older programs that may require older-than-standard versions of some dependency packages, particularly programming language versions (ex.: programs written with Python 2.X will not be compatible with a conda
environment running Python 3.X, and visa versa).
You can create a conda
environment like so:
# working on the server
conda create -n phylo-tutorial-env #python=3.9
# feel free to name the environment whatever you like
# just try to make sure your name is descriptive, so you can remember what it was for
# the `python=3.9` is necessary to ensure a current version of Python is installed, instead of an outdated one
# You'll see a lot of scrolling text, and then need to confirm creation of the environment with "y"
# once setup is done, activate the environment:
conda activate phylo-tutorial-env
# you can deactivate the environment later with:
conda deactivate
# remember not to use that above command in your base conda environment!
With the conda
environment set up, you can install the relevant software. Below, I demonstrate the software installations I used while setting up this tutorial. Note that in some plaecs I use mamba
instead of conda
. The mamba
program is basically an ultra-fast version of conda
(it needs to be seperately installed). It's much quicker and more effective at resolving dependencies when installing programs.
# install programs
# start with Python- and Jupyter-related programs
mamba install python matplotlib numpy scipy pandas seaborn jupyterlab=4.0.12 jupyter
# jupyterlab=4.0.12 is necessary because of a but in the most recent jupyterlab=4.1.0 version
# ref: https://github.com/jupyterlab/jupyterlab/issues/15461
# install bioinformatics tools used in the pipeline
mamba install -c conda-forge biopython
mamba install -c bioconda mafft iqtree trimal cialier
I then export the conda
environment to a file, and build a container from it:
# make sure you are in your base conda environment for this
# creating a YAML file of the conda environment
conda env export -n phylo-tutorial-env > phylo-tutorial-env.yml
# using Apptainer to build the conda environment in a container
apptainer build --build-arg ENV_FILE=phylo-tutorial-env.yml phylo-tutorial-env.sif conda_environment_args_v2.def
# run the Jupyter Notebook from the container from the command line like so:
apptainer exec phylo-tutorial-env.sif jupyter lab
(Mandatory) Running the container¶
From this point on, the information in this tutorial actually should be followed. Here, you'll find instructions on how to use the container created for this tutorial and exercise, by modifying your existing run_jupyter.sh
script.
Similarly to what you did for the Sequencing Technologies tutorial, you need to modify the portion of your run_jupyter.sh
script that specifies the location of the container you are using.
In this case, please change the line specifying the path to the container to the following:
container=/cephyr/NOBACKUP/groups/bbt045_2024/Phylogeny/phylo-tutorial-env.sif
In addition, I recommend that you change the number of cores you request in the run_jupyter.sh
script to 2, at least for the portion of this Jupyter Notebook where you run IQ-TREE. If you choose to do this, change the command like so:
# this is what the command should look like before modification
srun -A ${PROJ_ID} -n 1 -t ${TIME} --pty apptainer exec $container jupyter lab --ip=${IP} --port ${STUDENT_ID}
# to change ii to use 2 cores instead of 1, change the argument given to the -n flag, like so:
srun -A ${PROJ_ID} -n 2 -t ${TIME} --pty apptainer exec $container jupyter lab --ip=${IP} --port ${STUDENT_ID}
Setting up your directory system¶
Please also make a directory on the server in which you will store your files for this tutorial, as well as the exercise to follow. You would be surprised how swiftly the number of files you're using gets out of hand, so try to develop good habits from the beginning! For example, it's good practice to have a bin/
directory in your home directory, where you store executable files and the like for programs that you cannot simply install via conda
.
# in the directory where you have your files for Applied Bioinformatics
# for ex.: a directory named AppliedBioinfo/
mkdir PhyloWorkflow/
cd PhyloWorkflow/
mkdir Exercise/ Tutorial/
# note that `mkdir` can take multiple arguments
# so we've just created both directories with one line of code
Above, I used my personal preferred naming convention, but feel free to use whatever file names you wish, as long as they're descriptive.
Obtaining & Exploring Data¶
Data Location¶
All data files that you will need to run this tutorial can be found in the /cephyr/NOBACKUP/groups/bbt045_2024/Phylogeny/
directory. Materials for the tutorial can be found in the Tutorial/
subdirectory, while materials for the homework can be found in the Homework/
subdirectory.
Please copy the files necessary to run the tutorial to your working directory now, like so:
cp -r /cephyr/NOBACKUP/groups/bbt045_2024/Phylogeny/Tutorial/* PATH/TO/YOUR/PhyloWorkflow/Tutorial/
cp /cephyr/NOBACKUP/groups/bbt045_2024/Phylogeny/*.py PATH/TO/YOUR/PhyloWorkflow/Tutorial/
Preliminary data exploration¶
For this tutorial, we will be using the XP_001322682.1__Tvag.fasta file. Take a look!
- What kind of FASTA file is it?
Fill in your answer!
- What is the protein ID?
Fill in your answer!
This is a RefSeq protein, which means it is considered good quality. To quote the NCBI, the RefSeq database is a "comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein." (Source: https://www.ncbi.nlm.nih.gov/refseq/)
In T. vaginalis, XP_001322682.1 is predicted to function in the cytosol, though paralogs of the protein in T. vaginalis are known to function in the mitochondrion (Smutná et al. 2022)[^1].
[^1]: Technically speaking, T. vaginalis doesn't actually have a mitochondrion, per se. T. vaginalis and all other organisms within supergroup Metamonada have "Mitochondrion-Related Organelles" (MROs), which are extremely functionally reduced mitochondria. Monocercomonoides exilis, a member of supergroup Metamonada, is actually the only known eukaryote to completely lack a mitochondrion!
Finding homologous sequences¶
In order to find homologs of this gene to use in a gene tree, we're going to use NCBI BLAST. While command-line BLAST technically exists, overall it is much simpler and faster to use the web browser version. This is particularly the case as the conda
installation of NCBI BLAST doesn't work very well.
- What type of BLAST do we need to run? Explain your reasoning.
Fill in your answer!
From the NCBI BLAST homepage (https://blast.ncbi.nlm.nih.gov/Blast.cgi) select the appropriate BLAST algorithm.
You can BLAST the sequence in one of two ways: either you can copy the sequence into the search box, or you can use the gene name (XP_001322682.1). Since this is the name of the protein in the NCBI database, it is possible to search for BLAST hits using only the protein ID - if this was a protein from an organism you sequenced, with no official name in the NCBI database, you would only have the option of performing a sequence-based BLAST.
Go ahead and BLAST the gene, and maybe take a short coffee break. BLAST can sometimes take a minute.
ヾ(@⌒ー⌒@)ノ
The first hit should be our protein - go ahead and check! Do you notice anything about the quality of the BLAST hits?
Fill in your answer!
In the light green bar above the search results labeled "Sequences producing significant alignments" you'll find a "Download" drop-down menu. Select "FASTA (complete sequence)" and you'll download a file named seqdump.txt
. Rename this file to something meaningful (ex.: XP_001322682.1__MSAprep.fasta) and move it to your Tutorial/ working directory. You can do this using copy/paste into nano
, an SSH file transfer program like FileZilla, or scp
(scp XP_001322682.1__MSAprep.fasta CID@vera1.c3se.chalmers.se:/FULL/PATH/TO/WORKING/DIRECTORY
where you can fill in your username and the path to your working directory).
- How many sequences are in the file?
Fill in your answer!
Cleaning the data¶
Generally, when you compile protein sequences to use in an analysis, you want to clean the data in some way. A few common data transformations include: capitalizing all letters in sequence lines, editing header lines, removing non-standard characters from sequence lines, and conversion between multi-line and single-line FASTA format.
- Why might it be important to clean the data in this way?
Fill in your answer!
For the sake of time, I've prepared a cleaning script using Python (clean_MSA_seqs.py
). You can use it from the command line like so:
# model usage:
python clean_MSA_seqs.py input_fasta
# example application:
python clean_MSA_seqs.py XP_001322682.1__MSAprep.fasta
# Adapt the command above to your file names!
# remember that this cell will assume you are running Python commands - you need to tell it you're running something on the command line
I tend to comment my code pretty thoroughly, so it should be quite readable, but let me know if you have any questions. If you're up for a challenge, see if you can write something like this yourself! (But finish the tutorial first - come back to this later if you still have the time.)
- Compare the original and cleaned files. Do they contain the same number of sequences? The same number of characters? What changed, if anything? Why do you think this is?
Fill in your answer!
Since our sequences are now cleaned, we can move on to generating the alignment.
Multiple Sequence Alignment¶
Generating the MSA¶
In order to generate the MSA, we're going to be using the MAFFT software. You can find more information on this program at the links below:
- Homepage: https://mafft.cbrc.jp/alignment/software/
- Manual page: https://mafft.cbrc.jp/alignment/software/manual/manual.html
- Take a look at the MAFFT manual page. Which algorithm do you think best suits our purposes? Why?
Fill in your answer!
Now, go ahead and create the MSA:
! mafft --localpair --maxiterate 1000 --amino XP_001322682.1__MSAprep_CLEAN.fasta > XP_001322682.1__MSA.fasta
# Adapt the command above to your file names!
# `--localpair --maxiterate 1000` tells MAFFT to use the L-INS-i algorithm
# `--amino` tells MAFFT that the input is a protein FASTA
Viewing the MSA¶
There are many different tools that you can use to view an MSA. I've provided a few examples below:
- Web-based tools:
- As with most aspects of bioinformatics, there are tools available on the web in order to view MSAs. As is often the case with web-based programs, though, their scope is rather limited (especially for the tree programs).
- EMBL-EBI MView: EMBL-EBI provides a web-based tool where you can upload an MSA, and see the results. Access it from here: https://www.ebi.ac.uk/jdispatcher/msa/mview
- NCBI Multiple Sequence Alignment Viewer: The NCBI provides a web-based MSA viewer, which you can access from here: https://www.ncbi.nlm.nih.gov/projects/msaviewer/
- Stand-alone software:
- Software designed for phylogenomics analysis provides far more flexibility than web-based tools, though this of course comes with the trade-off of requiring installation, and taking up space on your hard drive.
- AliView: (My personal favorite) This program from Uppsala Univeristy provides smooth viewing and editing of MSAs. Find more information on it here: https://ormbunkar.se/aliview/
- MEGA-11: The MEGA software suite allows a huge range of phylogenomics analysis tools. You can create MSAs, edit alignments, visualize phylogenetic trees, perform bootstrap testing... All from within a GUI window! Find it here: https://www.megasoftware.net/
- CIAlign: This software suite works from the command line, and is installed in the phylogeny container. It can be used to view (portions of) and edit MSAs.
- See the GitHub page here: https://github.com/KatyBrown/CIAlign
- See the documentation here: https://cialign.readthedocs.io/en/latest/
alen
: A simple command-line MSA viewer, installed in the container. See the GitHub page here: https://github.com/jakobnissen/alen
For now, we will use the web-based tool provided by EMBL-EBI (https://www.ebi.ac.uk/jdispatcher/msa/mview). Upload your MSA file, give it a minute to process, and then take a look at the results!
If you would prefer to stay on the HPC, please use alen
, instead! The command is interactive, so you'll need to run the following in a new terminal you open in Jupyter Lab, rather than from a cell in your Jupyter Notebook:
/opt/alen/bin/alen XP_001322682.1__MSA.fasta
# A peek behind the curtain:
# I struggled to add alen to the PATH, so it's best to use it by calling the full PATH to the executable within the container
- Do you notice any patterns?
Fill in your answer!
Creating the tree¶
We will be using the IQ-TREE software to generate the gene tree. You can find more information about this program at the links below:
- Homepage: http://www.iqtree.org/
- Manual: http://www.iqtree.org/doc/iqtree-doc.pdf
Go ahead and run the command you see below - there will be a lot of text printed to the screen, but don't worry about redirecting it to a file to look at later, because all of it will also be printed to the log file generated automatically by IQ-TREE. This will take a few minutes. So feel free to grab a coffee, take a short break!
♪(^∇^*)
Once you're ready, feel free to read the little chunk of text below the code block here - it'll provide a little more information this type of analysis.
! iqtree -s XP_001322682.1__MSA.fasta --prefix XP_001322682.1__MSA_IQ -m LG+I+R5 -seed 12345 -wbtl -B 1000 -T AUTO -ntmax 2
# Adapt the command above to your file names!
# -s is the option to specify the name of the alignment file that is always required by IQ-TREE to work.
# -m is the option to specify the model name to use during the analysis.
# The special MFP key word stands for ModelFinder Plus, which tells IQ-TREE to perform ModelFinder
# and the remaining analysis using the selected model.
# Here, the model to use has been pre-selected: LG+R5
# To make this reproducible, need to use -seed option to provide a random number generator seed.
# -wbtl Like -wbt but bootstrap trees written with branch lengths. DEFAULT: OFF
# -T AUTO: allows IQ-TREE to auto-select the ideal number of threads
# -ntmax: set the maximum number of threads that IQ-TREE can use
Note that a typical tree-finding process is quite a bit longer than what you did here. IQ-TREE has a specific argument -m MFP
that calls a process called Model Finder Plus which tests many, many different tree models, and finds the one that best fits the data. (Don't worry about what these models are - that's beyond the scope of this class. Suffice to say, it's complicated statistics.) I ran this analysis with -m MFP
while preparing this exercise, and even for such a small dataset (only 100 sequences), the process took roughly 2 hours! Clearly, not something we could all do together in class. Tree finding is a complex, computationally demanding process, but is a crucial part of phylogenetic reconstruction, and not the step where you should try to spare CPU hours.
Take a look at the logs for your IQ-TREE run in order to answer the following:
- What are parsimony-informative sites? How many did IQ-TREE detect for you?
Fill in your answer!
- Did IQ-TREE correctly infer the file format and sequence type you had? Why might this information be useful to include in the logs?
Fill in your answer!
Note that another important element of tree building is bootstrapping (there's more information on this topic later in the tutorial). Unfortunately, bootstrapping is an extremely resource-intensive process, so for the same of time, I have prepared 4 bootstrapped trees for you, in the /cephyr/NOBACKUP/groups/bbt045_2024/Phylogeny/Tutorial/BootstrapEx/
directory, which you have likely already copied over.
If you would like to run the bootstrapping yourself, please do so! Simply remove the first line of the following cell to make it runnable. You can also copy/paste the following code into a new terminal in Jupyter Lab, to leave it running in the background.
%%script false --no-raise-error
! iqtree -s XP_001322682.1__MSA.fasta --prefix XP_001322682.1__MSA_IQ -m LG+I+R5 -seed 12345 -wbtl -B 1000 -T AUTO -ntmax 2
# -B specifies the number of bootstrap replicates where 1000 is the minimum number recommended
Visualizing Trees¶
There are many different tools that you can use to visualize a phylogenetic tree. I've provided a few examples below:
- Web-based tools:
- As with most aspects of bioinformatics, there are tools available on the web in order to visualize phylogenetic trees. As is often the case with web-based programs, though, their scope is rather limited (especially for the tree programs).
- ETE Toolkit: The ETE Toolkit is available as a Python package, but they also have a web server where you can visualize your trees, here: http://etetoolkit.org/treeview/
- NCBI Tree Viewer: The NCBI provides a web-based phylogenetic tree viewer, which you can access from here: https://www.ncbi.nlm.nih.gov/tools/treeviewer/
- Stand-alone software:
- Software designed for phylogenomics analysis provides far more flexibility than web-based tools, though this of course comes with the trade-off of requiring installation, and taking up space on your hard drive.
- FigTree: (My personal favorite) This program allows you to open trees and edit components of its visualization, before exporting in a variety of different file types (PNG, JPEG, SVG, etc.). It's a JAVA-based application, so if you have Java installed on your computer, no further installation processes will be necessary to open FigTree. Find it here: http://tree.bio.ed.ac.uk/software/figtree/
- MEGA-11: The MEGA software suite allows a huge range of phylogenomics analysis tools. You can create MSAs, edit alignments, visualize phylogenetic trees, perform bootstrap testing... All from within a GUI window! Find it here: https://www.megasoftware.net/
- Packages built for bioinformaticians:
- There are plenty of packages/libraries available for the visualization of phylogenetic trees, built to work with the programming languages most used by bioinformaticians: Python and R. These editing tools have a higher learning curve, since you need to code to change aspects of the tree, but they also allow far more flexibility than either web-based tools or stand-alone software.
- Python:
- The ETE Toolkit (mentioned above) is actually primarily a Python package. Find it here: http://etetoolkit.org/
- Biopython is a whole suite of Python packages for bioinformattics analysis, so of course, they have their own package for working with phylogenetic trees,
Phylo
. Find it here: https://biopython.org/wiki/Phylo
- R (this course doesn't use R, but it's a fantastic language for visualization):
- The
ape
library in R can be used to visualize and edit phylogenetic trees. It can be installed the usual way (install.packages(ape)
). The creators of the package have provided a great tutorial, which you can find here: http://ape-package.ird.fr/misc/DrawingPhylogenies.pdf - The
ggtree
library was created by Bioconductor, which provides a suite of R tools for bioinformatics analysis. The program is built to work likeggplo2
, except for trees. You can find more information (including installation instructions, which are a bit different for Bioconductor packages) here: https://bioconductor.org/packages/release/bioc/html/ggtree.html
- The
Feel free to explore these programs and packages at your leisure, and find what works best for you. For now, for the sake of time, I have written a Python script using the Biopython Phylo module that you can use to visualize your results, named visualize_PhyloTree_base.py
. Fill in the file name and path to your files, and you should be good to go! (Hint: If these instructions aren't clear, take a look at the documentation within the script!)
For the input file for this script, use the FILENAME.treefile file output by IQ-TREE. This file contains the phylogenetic tree generated from the MSA in NEWICK format. A Newick tree is a 1-line simple text representation of a phylogenetic tree, that should be recognized by any phylogenetic tree visualization software.
# run the python script to visualize your tree in this cell!
! python visualize_PhyloTree_base.py XP_001322682.1__MSA_IQ.treefile
- Take a look at the tree that you have generated. What do you notice? Are there any interesting patterns?
Fill in your answer!
Editing the MSA¶
As we discussed during the lecture, cleaning up an MSA is an important part of a phylogenetic analysis workflow. Test out some of the strategies we discussed on the MSA you made, and see how if anything changes!
In order to edit an MSA, you have two options:
- Install AliView on your local computer (not the server). This program will allow you to examine and edit alignments manually. Find it here: https://ormbunkar.se/aliview/
- Create a
conda
environment (as shown below), and play around with the settings of a MSA editing software, for ex.: TrimAl (http://trimal.cgenomics.org/introduction) or CIAlign (https://cialign.readthedocs.io/en/latest/pages/introduction.html). If you choose this option, please make sure to visualize the MSAs you create in a web browser, so that you can see for yourself the differences in the alignment.
Please perform at least two (2) tree filtrations.
Note that while Option 1 does require you to install software, it is likely the simpler option, especially for those with less experience coding. It is also more interactive.
AliView¶
Install AliView by following the instructions for your operating system, at: https://ormbunkar.se/aliview/
Then do the following:
- Open the program
- Navigate: File → Open File → Nagivate to and select your MSA to open it in the program
- Turn on Edit Mode: Edit → Edit mode (should have a check mark if edit mode is turned on)
- Select portions of the alignment to remove: Select & drag your cursor along the position numbers at the top → Edit → Delete selected
- You can also try a variety of different editing options within the Edit menu (ex.: Delete gap-only columns)
- Save the new MSA to a new file with: File → Save as Fasta
- Visualize the gene tree again with the new MSA, and compare it to the species tree and other gene tree(s). What has changed (if anything)?
Editing using conda
¶
Install alignment editing software within that environment:
- TrimAl installation with
conda
: https://anaconda.org/bioconda/trimal - CIAlign installation instructions: https://cialign.readthedocs.io/en/latest/pages/installation.html
- Note that these two programs are already installed within the container environment.
And use the options found in the program manuals to play around with editing the alignments.
- TrimAl command line usage manual: http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2
- CIAlign command line usage manual: https://cialign.readthedocs.io/en/latest/pages/usage.html
#TrimAl example usage:
trimal -in XP_001322682.1__MSA.fasta -out XP_001322682.1__MSA_trim1.fasta -gappyout
# CIAlign example usage:
CIAlign --infile XP_001322682.1__MSA.fasta --outfile_stem XP_001322682.1__MSA_trim2 --clean
%%bash
# write your trimming code here
Comparing trees¶
Once you have completed your alignment trimming, generate and visualize new versions of the phylogenetic tree. THen answer the following question:
- How does the cleaned tree compare to the original version?
Fill in your answer!
%%bash
# you may want to generate new trees and visualize them here
Bootstapping¶
If you look carefully at the IQ-TREE command options, you'll notice that one of the arguments (-B 1000
) tells IQ-TREE to run the tree generation with bootstrapping.
- What is bootstrapping/bootstrap support? Why does it matter?
Fill in your answer!
It's possible to display bootstrap values on phylogenetic trees with the Bio.Phylo.draw()
command. Take a look at the visualize_PhyloTree_base.py
script, and modify it to add this feature. (Remember to also modify the output file name, so you don't overwrite your previous files!) Save the modified version of the file as visualize_PhyloTree_bootstrap.py
, and run it on the treefiles that have bootstraps.
- What do you observe, looking at the bootstrapping? Are there any notable patterns?
Fill in your answer!
%%bash
# visualize the bootstrapped trees
Applications of phylogenomics¶
Above, you've implemented a phylogenomics pipeline. But in an actual research project, creating and pruning a phylogenetic tree is just the beginning!
In this last portion of the tutorial, you will dive more deeply into a few proteins, in order to compare them; and experience a bit more accurately what research is generally like. The tutorial will provide you some pointers, but they won't be step-by-step instructions.
To begin, select a few proteins (ideally 3-5), to look at in greater detail and compare. The proteins below are mostly suggestions (assuming they turned up in your BLAST search, too!), except for the 1st T. vaginalis protein that you started this tutorial with, which you should use [^2]:
- XP_001322682.1 (Trichomonas vaginalis)
- BAF82035.1 (Pseudotrichonympha grassii)
- XP_001580286.1 (Trichomonas vaginalis)
- XP_012899174.1 (Blastocystis hominis)
[^2]: Note that not every single protein from the NCBI has been predicted by AlphaFold - you want the well-curated proteins, which usually start with "XP*", though not always, as evidenced by the P. grassi protein I'll be using in my example solutions we'll discuss during the tutorial.
Structural Bioinformatics¶
In 2021, protein structure prediction was revolutionized by two new pieces of software: AlphaFold and RoseTTAFold. Both of these are AI-powered protein 3D structure prediction algorithms. AlphaFold eventually released the AlphaFold Protein Structure Prediction Database, in which they included predictions for a large number of the proteins in the UniProt database.
Take a look at the structure of the proteins you have selected on the AlphaFold structure prediction database. Note that this won't be as simple as pasting the protein ID into the search bar - you'll need to find the protein's name from the NCBI.
- What do you notice about the proteins? Do the 3D predictions look similar to your eyes?
Fill in your answer!
- Do any of these proteins seem to have strange/unexpected structures? Explain.
Fill in your answer!
Functional profiling of proteins¶
You can also learn about what proteins do through functional profiling. This, together with other information can be used to learn about the evolution of gene families.
As an example to illustrate this, you're going to prepare a consensus sequence from the MSA, and then use functional prediction algorithms to infer the protein functions of XP_001322682.1 (the T. vaginalis protein you started with), KAJ4462103.1 (a sequence that should have been returned in the BLASTp search, belonging to Paratrimastix pyriformis, a free-living protist in supergroup Metamonada) and that consensus sequence.
- Why might we want to compare the inferred functions of the consensus sequence and our original query sequence?
Fill in your answer!
To begin, run the code below to create a FASTA file with the MSA consensus sequence, generated by Biopython. Note that you may need to modify the file name!
# ref: https://stackoverflow.com/questions/73702044/how-to-get-a-consensus-of-multiple-sequence-alignments-using-biopython
# ref: https://biopython.org/docs/1.75/api/Bio.Align.AlignInfo.html
# Part 1: Import necessary modules, determine input & output files
# import relevant modules
import os # allows access to the operating system
# modules necessary for the function to work
from __future__ import annotations
from pathlib import Path
from itertools import chain
# necessary Biopython modules
import Bio
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.Align.AlignInfo import SummaryInfo
# load input & output files
input_msa = "XP_001322682.1__MSA.fasta"
# extract the file basename
base = os.path.basename(input_msa)
outfile_base = os.path.splitext(base)[0]
# designate the output file nam
output_file = outfile_base + '__ConSeq.fasta'
# create a SeqRecord object with Biopython
SeqRecord = Bio.SeqRecord.SeqRecord
# Part 2: Create the consensus sequence
# define a function to return the consensus sequence
# ref: https://stackoverflow.com/a/73705206/18382033
def get_consensus_seq(filename: Path | str) -> SeqRecord:
# create an alignment object from the file
common_alignment = MultipleSeqAlignment(chain(*AlignIO.parse(filename, "fasta")))
# create a summary object for the MSA
summary = SummaryInfo(common_alignment)
# create the consensus sequence
consensus = summary.dumb_consensus(threshold=0.35)
# note that dumb_consensus() raises a deprecation error, but you can ignore this
# return the consensus sequence
return consensus
# use the function to retrieve the consensus sequence
consensus_seq = get_consensus_seq(input_msa)
# take a look at the sequence
print(consensus_seq)
# Part 3: Write out result in FASTA format
# write out the consensus sequence to a file
with open(output_file, "w") as outfile:
# open the output file for writing
# write the FASTA header line first - you can change this if you'd like
outfile.write(">XP_001322682.1__MSA__Consensus" + "\n")
# then write out the consensus sequence to the file
outfile.write(str(consensus_seq))
Next, you'll need to create a concatenated FASTA file containing the three sequences you'll be using. I've written the code to retrieve the sequences, but you may need to modify the file names, and you'll definitely need to write the code to write the sequences out to a single FASTA file.
# Part 1: Import necessary modules, determine input & output files
# import relevant modules
import os # allows access to the operating system
# set up input & output files
input_seq_file = "XP_001322682.1__MSAprep_CLEAN.fasta"
consensus_file = "XP_001322682.1__MSA__ConSeq.fasta"
# extract the file basename
base = os.path.basename(consensus_file)
outfile_base = os.path.splitext(base)[0]
# get only the original query protein name
outfile_split_base = outfile_base.split("__")[0]
# designate the output file name
output_file = outfile_split_base + '__functQuery.fasta'
# Part 2: Import the query data
# create a list of query proteins
query_list = ["XP_001322682.1", "KAJ4462103.1"]
# create an empty dictionary for the query sequences
fasta_dict = {}
# parse the large fasta file to retrieve the query sequences
with open(input_seq_file, "r") as infile:
# open the large sequence file for reading
for line in infile:
# iterate over the file line by line
if line.startswith(">"):
# identify the header lines
# and save them to a new variable
header = line
# check if any of the list elements are in the header line
# ref: https://stackoverflow.com/questions/6531482/how-to-check-if-a-string-contains-an-element-from-a-list-in-python
if any(gene_id in header for gene_id in query_list):
# use any() & list comprehension to find the headers for the query protein IDs
# save the sequence which is in the next line of the file
fasta_seq = next(infile)
# and add both the header & sequence to the dictionary
fasta_dict[header] = fasta_seq
# add the consensus sequence to the fasta dictionary
with open(consensus_file, "r") as infile:
# open the consensus FASTA sequence file for reading
for line in infile:
# iterate over the file line by line
if line.startswith(">"):
# identify the sequence header line
# and save it to a variable
header = line
else:
# for the sequence line
# save the sequence to a variable
fasta_seq = line
# add the consensus sequence & header to the dictionary
fasta_dict[header] = fasta_seq
# Part 3: Write out resulting FASTA to file
with open(output_file, "w") as outfile:
# open the output file for writing
for key in fasta_dict.keys():
# iterate over the dictionary via its keys
# and write the key:value pair to the outfile
outfile.write(key + fasta_dict[key])
# note that since I didn't strip the endline characters, they don't need to be added here
Now that you have your 3 sequences to test, it's time to run functional analysis on them.
For this tutorial, you'll be using EggNOG, a commonly used tool for the functional profiling of genes. EggNOG has its own internal curated orthology database based on gene function, that you can compare genes to. They also incorporate information from a variety of databases (like KEGG and PFam as hosted by Interpro, another tool similar to EggNOG)
Unfortunately, the EggNOG command-line program takes up quite a bit of disk space, so for this exercise, you'll be using the online eggNOG-mapper. Transfer the file with the three sequences to your computer (ideally via scp
or a file transfer program like FileZilla; but you can also print the sequences to your terminal, and then copy-paste them into a new file on your local machine), and run the eggNOG-mapper on them. Provide your email and use the interactive job management system.
(This can take a few minutes, so perhaps go grab yourself a coffee?)
(o゜▽゜)o☆
Once your job is done, retrieve your files & transfer them to the server. You can either transfer with scp
, a file transfer program like FileZilla, or download the files directly to Vera using wget
.
%%bash
# create results directory for EggNOG files
mkdir EggNOG_Results
# dowload results files
wget -P EggNOG_Results FILE/PATH/HERE
Finally, use the InterPro web search to try a different way of doing functional annotation. Use the same FASTA file you used as input for EggNOG. Once the job has completed (this should take less time than EggNOG), once again download the results files to the server.
%%bash
# create a directory
mkdir IPRScan_Results
# download the files
wget -P EggNOG_Results FILE/PATH/HERE
Now take a look at the results files. The ANNOTATIONS file is the main output file from EggNOG, and the GFF file would probably be the easiest to look at of the InterPro results files.
- What do you notice? Do the functional profiles of these proteins look similar to you? Are there any differences between the results of EggNOG and InterPro?
Fill in your answer!
- What do these results suggest about the evolution of this gene?
Fill in your answer!
Summing up¶
We will discuss these questions together at the end. This cell is included in this document for you to consider them, and to take notes on the class discussion.
- How well does your gene tree match the species tree of supergroup Metamonada?
Fill in your answer!
- Is this a good gene to use to reconstruct the phylogeny of these species? Why or why not?
Fill in your answer!
Citations¶
Smutná, T., Dohnálková, A., Sutak, R., Narayanasamy, R. K., Tachezy, J., & Hrdý, I. (2022). A cytosolic ferredoxin-independent hydrogenase possibly mediates hydrogen uptake in Trichomonas vaginalis. Current Biology, 32(1), 124-135.e5. https://doi.org/10.1016/j.cub.2021.10.050
Stairs, C. W., Táborský, P., Kolisko, M., Pánek, T., Eme, L., Hradilová, M., Vlček, Č., Jerlström-Hultqvist, Jon Roger, A. J., & Čepička, I. (2021). Anaeramoebae are a deeply divergent lineage of eukaryotes that clarify the transition from anaerobic mitochondria to hydrogenosomes. Current Biology, 31, 1–8.
Töpel, M. (2019, November). Webbased Phylogenomic analysis · The-Bioinformatics-Group/Teaching Wiki. GitHub. https://github.com/The-Bioinformatics-Group/Teaching/wiki/Webbased-Phylogenomic-analysis
Trichomoniasis - STD information from CDC. (n.d.). Retrieved October 11, 2021, from https://www.cdc.gov/std/trichomonas/default.htm