Automatic annotation of MYBs
Last update: Jul 14, 2024
MYB_annotator is available on our webserver
Please get in touch if you need help running the MYB annotator on your own dataset: Boas Pucker (email)
This tool allows an automatic identification and analysis of MYBs in a CDS or peptide sequence collection. Since the quality of structural annotation has substantially improved during the last years, the focus can now shift towards functional annotation. RNA-seq and long read sequencing of transcripts were fundamental for this development. There are numerous publications about R2R3-MYBs in various plant species. This tools allows the automatic investigation of the MYB gene family in any new species. The provided MYB sequences should be sufficient as baits in most plant species thus no manual preparation of input files is required. Results are a FASTA file containing the bona fide MYB sequences and a summary table with additional details. MYB candidates are screened for the presence of the conserved repeats and also for many other motifs. The function of all MYB candidates is predicted based on orthology to a previously characterized sequence (often in Arabidopsis thaliana).
While some dependencies are required, this tool does not require an installation. Downloading and executing the script on a Linux system is sufficient. There is currently no support for other operating systems. Most required modules are included in the initial Python installation, but dendropy might not be available on all systems.
git clone https://github.com/bpucker/MYB_annotator
cd MYB_annotator
how to install conda(click to expand)
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash Anaconda3-2020.02-Linux-x86_64.sh
rm Anaconda3-2020.02-Linux-x86_64.sh
echo 'export PATH="~/anaconda/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc
conda env create -f environment.yml
conda activate mybannotator
Python3 (sudo apt-get install python3.8
). It is also possible to use other Python3 versions.
dendropy (sudo apt install python3-pip && python3 -m pip install git+https://github.com/jeetsukumaran/DendroPy.git
)
MAFFT (sudo apt-get install -y mafft
)
BLAST (sudo apt-get install ncbi-blast+
) or HMMER(conda install -c bioconda hmmer
)
FastTree (sudo apt-get install -y fasttree
) and/or RAxML-NG (precompiled binaries recommended)
Usage
python3 MYB_annotator.py --baits <FILE> --info <FILE> --out <DIR> [--subject <FILE> | --subjectdir <DIR>]
Mandatory:
--baits STR MYB bait sequence file
--info STR MYB info file
--out STR Output folder
--subject STR Subject file (OR: --subjectdir STR Subject dir)
Optional:
--search STR Tool for initial search (blast|hmmer)[blast]
--mode STR Tool for tree construction (fasttree|raxml)[fasttree]
--hmm STR HMM file of MYB baits
--refmybs STR Reference MYB file
--cpu INT Number of threads [4]
--cdsinput NONE Changes expected input to CDS
--keepnames NONE Prevents splitting of sequence names at first space
--ath STR FASTA file with A. thaliana MYBs
--name STR Prefix of output file names
--collapse NONE Reduces paralogs to one representative
--motifs STR File with motifs to check in candidate sequences
--cpub INT Number of threads for BLASTp[cpu]
--cpur INT Number of threads for RAxML[cpu]
--mafft STR Path to MAFFT [mafft]
--blastp STR Path to blastp [blastp]
--hmmsearch STR Path to hmmsearch [hmmsearch]
--makeblastdb STR Path to makeblastdb [makeblastdb]
--fasttree STR Path to FastTree [fasttree]
--raxml STR Path to RAxML [raxml-ng]
--simcutp FLOAT BLASTp similarity cutoff [60]
--poscutp INT Max number of BLASTp hits per bait [100]
--lencutp INT Min BLASTp alignment length [75]
--numneighbours INT Neighbours to consider for classification [10]
--neighbourdist FLOAT Cutoff in neighbour identification [3]
--minneighbours INT Minimal number of bait neighbours for classification [0]
--paralogdist FLOAT Distance cutoff in paralog maksing [10]
--baits
specifies a FASTA file that contains the MYB bait sequences. Some of these sequences are bona fide MYBs and other are MYB-like sequences that are used as outgroup.
--info
specifies a text file that contains a table with all MYB IDs (matching the baits FASTA file) in the first column. Columns are TAB-separted. The second column contains an assignment to the ingroup ('in') or outgroup ('out'). Ingroup and outgroup baits are used to classify MYB candidates as bona fide MYBs or MYB-likes. For example, MYB-like/CDC5 sequences of A. thaliana are marked as outgroups in the baits_MYBs_info.txt. They are considered the closest known relatives of MYB sequences.
--out
specifies the output folder. This folder will be created if it does not exist already. All files generated during the analysis will be placed in this folder. There is a subfolder called 'RESULTS' that contains the important final files.
--subject
specifies the FASTA file that should be analysed. The default expectation is that this file contains peptide sequences. However, it is also possible to supply a collection of coding sequences (CDS) if the --cdsinput
flag is set. Providing a file with only primary transcript/peptide sequences can result in different results compared to collection of all isoforms. This is especially important with respect to the MYB domain which might be incomplete in some isoforms. If possible, avoid any special characters in the headers of your supplied sequences. However, the MYB_annotator contains a filtering step that will remove some characters. For example, tabs will be replaced by three spaces.
--subjectdir
specifies a folder with FASTA files. All files need to contain sequences of the same type (PEP or CDS). Supported file extensions are '.fasta', '.fas', '.fa', '.FASTA', '.FAS', '.FA', '.fna', '.FNA', '.cds', '.CDS', '.pep', and '.PEP'.
--search
specifies the tool for the initial identification of candidates. BLAST or HMMER can be used for this purpose. Default is BLAST.
--mode
specifies the tool for the construction of phylogenetic trees. RAxML ('raxml') or FastTree2 ('fasttree') can be specified. Default is 'fasttree'. Running the tree construction with RAxML is expected to result in slightly more accurate results, but would increase the run time from minutes to days in many cases.
--hmm
specifies a HMM file that represents the bait MYBs. This file is required when using hmmsearch for the identification of initial candidates.
--refmybs
specifies a text file with one MYB ID per line. All IDs listed in this file need to be present in the baits FASTA file and the info file. These IDs can be used for the functional annotation of the newly identified MYBs. This assignment is performed in two ways leading to two different output tables. The functional annotation is lifted from previously characterized sequences to newly identified orthologs.
--cpu
specifies threads to use for the BLASTp search and also for the RAxML tree construction. Default: 4.
--cpub
specifies specifically the number of threads to use for the BLASTp search. This option can be used if the number of cores to use for BLASTp and RAxML is different. Default is the value of --cpu
.
--cpur
specifies specifically the number of threads to use for the RAxML analysis. This option is only relevant if RAxML is used for the tree construction. This option can be used if the number of cores to use for BLASTp and RAxML is different. Default is the value of --cpu
.
--cdsinput
changes the input expectation from PEP to CDS. The file needs to be supplied through the --subject
. This flag can be included at any place in the command and does not need a value. The coding sequences will be translated pior to the analysis.
--keepnames
this flag deactivates the trimming of the names of input sequences at the first space. This flag can be included at any place in the command and does not need a value.
--ath
specifies a FASTA file with Arabidopsis thaliana MYB sequences that are used to construct a final tree with the candidate sequences and these reference MYB sequences.
--name
specifies a string that will be attached as prefix to all file names in the result folder. This option is useful if the result files of different analyses will be combined in one folder in a later step of the project. This name should NOT use any special characters. Just using the numbers 0-9 and the normal characters a-z or A-Z should be save.
--collapse
this flag triggers and additional analysis step. Group of paralogs will be represented by a single (the longest) sequence. This is helpful if a transcriptome assembly is analyzed which might have many isoforms.
--motifs
specifies a file with motif sequences and names. All candidate sequences will be screened for these motifs. Many subgroups are characterized by conserved motifs outside the R2R3 domain. This option allows an automatic analysis of all candidates and summarizes the results in a table.
--mafft
specifies the full path to MAFFT including the binary.
--blastp
specifies the full path to BLASTp including the binary.
--hmmsearch
specifies the full path to hmmsearch including the binary.
--fasttree
specifies the full path to FastTree2. This is required if fasttree is not included in the PATH variable. If in doubt, please specify the full path.
--raxml
specifies the full path to RAxML-NG.
--simcutp
specifies the minimal similarity of BLASTp hits to be considered in the initial identification of MYB candidates. Default: 60.0 (60% similarity).
--poscutp
specifies the maximal number of hits (possibilities) per bait sequence in the BLASTp analysis. Default: 100.
--lencutp
specifies the minimal alignment length of BLASTp hits to be considered in the initial identification of MYB candidates. Default: 75.
--numneighbours
this specifies the number of neighbouring sequences in a phylogenetic tree that will be considered for the classification as MYB or MYB-like, respectively. Default: 10.
--neighbourdist
specifies the factor by which the average branch length is multiplied when deciding about the consideration of neighbouring sequences, because sequences sitting on very long branches are ignored in this step. Increasing this value will enhance the sensitivity of sequence detection at the cost of specificity. Larger values should be tested when looking for completely new MYB clades in phylogenetic lineages that are not well covered by the collection of bait sequences. Default: 3. When searching for novel MYB clades that might not contain a single reference MYB sequence, increasing this value to 10 is recommended.
--minneighbours
specifies the minimal number of neighbours that is required for the classification of sequences. If there are no close neighbours, a sequence might be sitting at an extremely long branch. Therefore, it can be appropriate to exclude sequences without any neighbouring sequences from the set of clean MYBs. Default: 0.
--paralogdist
specifies the maximal distance of paralogs in one clade during the masking/collapsing step. A clade of paralogs can be reduced to one representative sequence. This is particularly helpful when working with transcriptome assemblies that contain an inflated number of alternative transcripts. Default: 10.
WARNING: If errors occur or if parameters are changed, it is necessary to delete all output files to get clean files based on the new parameters. Existing files are usually not overwritten.
This example should help to understand how the script is used. The required MYB baits FASTA file, the MYB info, and the optional reference MYB file (Arabidopsis thaliana MYBs) are included in this repository.
python3 /my_folder/MYB_annotator.py \
--baits /my_folder/data/bait_MYBs.fasta \
--info /my_folder/data/bait_MYBs.txt \
--out /my_folder/test_analysis/ \
--subject /my_folder/data/Croton_tiglium_peptides.fasta \
--mode raxml \
--refmybs /my_folder/data/AthRefMYBs.txt \
--raxml /my_folder/bin/RAxML9/raxml-ng \
While running the script, several temporary files are generated, but there are not important. Please only look at files located in the 'RESULTS' folder. Here is a list of the possible result files. The generation of some files depends on setting specific options. Therefore, it is possible that you do not see all these files in your result folder. If you specified a prefix (--name), all file names with start with that prefix and thus differ in this respect from the file names listed here.
-
00_documentation.txt this file stores the command used to run the analysis and also the versions of the tools used for the individual steps. Information contained in this file should be included in the method section when writing a publication about the results of this analysis.
-
01_initial_candidates.fasta this FASTA file contains all sequences in your subject set that have some similarity to the bait MYB sequences. This selection depends on the setting that you used for the BLASTp search. Usually, this is a very large collection of sequences that have only small similarity to MYBs. Many of these sequences will be discarded in the next step, but this collection is important to achieve high sensitivity. Truncated MYBs could be of high interest and should not be discarded due to short alignments.
-
02a_clean_MYBs.fasta this FASTA file contains all MYB candidates that fall into the clade of the bait MYB sequences i.e. all MYB-like sequences should be removed at this stage. The accuracy might be low at the basis of the MYB clade, but this approach should allow the inclusion of truncated MYBs that are sitting deep in one of the MYB subgroups.
-
02b_in_out_MYB_analysis_results.txt this file provides detailed information about the evidence for inclusion or exclusion, respectively, of each initial MYB candidate. The decision is based on phylogenetic distances between the sequences in your subject set and the bait MYBs supplied. If a candidate is close to many bait sequences that are considered bona fide MYBs, the sequence will be included. If a candidate is located in a clade with MYB-like bait sequences, it will NOT be considered a MYB.
- Score = Ratio of ingroup matches to outgroup matches.
- IngroupMatches = Number of nearest neighbours (only baits considered) in the tree that are ingroup members.
- OutgroupMatches = Number of nearest neighbours (only baits considered) in the tree that are outgroup members.
-
03a_group_around_ref_MYBs.txt this file is a table which assigns all MYB candidates in the subject species to the provides reference sequences. This file is only generated if such reference sequences are provided (--refmybs).
-
03b_new_2_ref_myb_mapping_file.txt this file is table which assigns the best fitting reference MYB (hopefully the ortholog) to each of the candidate MYBs in your subject data set. This file is only generated if such reference sequences are provided (--refmybs).
- EdgeDistance = Number of edges between two leaves (sequences) on the tree.
- PatristicDistance = Cumulative length of all nodes between two leaves (sequences) on the tree.
-
04a_MYB_domain_check.txt this table contains the results of a search for the MYB domains (1R, R23R, 3R, and others) in all of the clean MYB candidates. Regular expressions are used to identify 3R, R2R3, or 1R motifs. If none of them is detected, the candidate is classified as other/pseudo-MYB. Here are the regular expressions used for the three domains:
-
R1 = '\w{3,4}W\w{17,21}W\w{17,21}W\w{5,8}'
-
R2 = '\w{5}[WF]{1}\w{18,21}W\w{15,27}[WY]{1}\w{4}' #F at pos1 and Y at pos3 are very rare
-
R3 = "\w{5}[WLIMF]{1}\w{14,21}W\w{17,21}[WYF]{1}\w{4}" #diversity at pos1 is high, but po2 and pos3 are conserved
These MYB domain patterns are based on Feng et al., 2017, Du et al., 2015, and Pucker et al., 2020. The category 'others' comprises all sequences that lack one of the afore mentioned domains. Genes can be included in this category if the analyzed transcript/peptide isoforms lacks a part of the domain. This can happen if only the 'primary' transcript/peptide sequence is provided.
-
-
04b_motif_check.txt this table contains the results of a search for a set of provided motifs in all of the clean MYB candidates. This file is only generated if a file with motifs was supplied (--motif).
-
04c_MYB_domain_check.fasta this file contains all MYB domains identified in the clean MYB candidates.
-
04d_MYB_domain_check.doc.txt this file quanifies the number of clean MYB candidates containing a certain domain. Statistics collected in this file can be combined into a summary table if more than one subject file is analyzed.
-
05 this tree is generated based on all MYB bait sequences and the identified clean MYB candidates of your subject file. The tool used for the tree construction depends on your settings.
-
06 this tree is generated based on all Arabidopsis thaliana and the identified clean MYB candidates of your subject file. The tool used for the tree construction depends on your settings.
-
07a_repr_MYBs.txt this file is generated if you choose to represent groups of paralogs by only one representative sequence (--collapse).
-
07b_repr_MYBs.fasta this file is generated if you choose to represent groups of paralogs by only one representative sequence (--collapse). The sequences contained in this FASTA file represent one group of paralogs. All singletons are included as well.
-
08a_repr_ath_MYBs.fasta this file contains all representative sequences (see #7) and also the Arabidopsis thaliana MYB sequences. Therefore, this file is only generated if groups of paralogs are collapsed to one representative sequence (--collapse) and the necessary Arabidopsis thaliana MYB sequences are provided (--ath).
-
08b this tree is generated based on all Arabidopsis thaliana and the representative MYB candidate sequences. The tool used for the tree construction depends on your settings. This tree file is only generated if the file '08a' was generated.
-
08d_MYB_domain_check.txt please seee the full description at #04a for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
-
08e_MYB_domain_check.fasta please seee the full description at #04c for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
-
08f_MYB_domain_check.doc.txt please seee the full description at #04d for details. This file is the result of the same analysis, but restricted to the representative sequences described above.
-
4_domain_detection_summary.txt this file is a summary of the MYB domain check results described in #04b across all analyzed species. This file is only generated if more than one species are investigated.
-
8_domain_detection_summary.txt this file is a summary of the MYB domain check results described in #08f across all analyzed species. It is similar to the one described above, but restricted to the representative sequences of paralog groups. This file is only generated if more than one species are investigated.
The MYB bait sequences are intended to cover a broad phylogenetic range. The bait sequence set comprises the MYB doamins of:
Arabidopsis thaliana (Stracke et al., 2001)
Beta vulgaris (Stracke et al., 2014)
Musa acuminata (Pucker et al., 2020)
and a collection by Du et al., 2015: Medicago truncatula, Populus trichocarpa, Citrus sinensis, Vitis vinifera, Solanum lycopersicum, Solanum tuberosum, Aquilegia coerulea, Oryza sativa, Zea mays, Amborella trichopoda, Picea abies, Selaginella moellendorfii, Physcomitrella patens, Chlamydomonas reinhardtii, Volvox carteri, Micromonas pusilla, Ostreococcus lucimarinus, and Cyanidioschyzon merolae
Pucker B (2022). Automatic identification and annotation of MYB gene family members in plants. BMC Genomics 23, 220 (2022). https://doi.org/10.1186/s12864-022-08452-5.
Pucker, B., Reiher, F. and Schilbert, H. M. Automatic identification of players in the flavonoid biosynthesis with application on the biomedicinal plant Croton tiglium. Plants 2020, 9, 1103. doi:10.3390/plants9091103.
Pucker B., Pandey A., Weisshaar B. and Stracke R. (2020). The R2R3-MYB gene family in banana (Musa acuminata): genome-wide identification, classification and expression patterns. PLOS ONE 15(10): e0239275. doi:10.1371/journal.pone.0239275.
Stracke R., Werber M., Weisshaar B. (2001). The R2R3-MYB gene family in Arabidopsis thaliana. Current Opinion in Plant Biology. 2001;4(5):447-456.
Stracke, R., Holtgräwe, D., Schneider, J., Pucker, B., Sörensen, T.R., and Weisshaar, B. (2014). Genome-wide identification and characterisation of R2R3-MYB genes in sugar beet (Beta vulgaris). BMC Plant Biol. 14: 249. doi:10.1186/s12870-014-0249-8.
Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010 Jun 15;26(12):1569-71. doi: 10.1093/bioinformatics/btq228. Epub 2010 Apr 25. PMID: 20421198.
Du, H., Liang, Z., Zhao, S. et al. The Evolutionary History of R2R3-MYB Proteins Across 50 Eukaryotes: New Insights Into Subfamily Classification and Expansion. Sci Rep 5, 11037 (2015). doi:10.1038/srep11037