New Research In
Articles by Topic
- Agricultural Sciences
- Applied Biological Sciences
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Environmental Sciences
- Immunology and Inflammation
- Medical Sciences
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
山西官方下载快乐十分:Enzyme annotation for orphan and novel reactions using knowledge of substrate reactive sites
Recent advances in synthetic biochemistry have resulted in a wealth of novel hypothetical enzymatic reactions that are not matched to protein-encoding genes, deeming them “orphan.” A large number of known metabolic enzymes are also orphan, leaving important gaps in metabolic network maps. Proposing genes for the catalysis of orphan reactions is critical for applications ranging from biotechnology to medicine. In this work, the computational method BridgIT identified potential enzymes of orphan reactions and nearly all theoretically possible biochemical transformations, providing candidate genes to catalyze these reactions to the research community. The BridgIT online tool will allow researchers to fill the knowledge gaps in metabolic networks and will act as a starting point for designing novel enzymes to catalyze nonnatural transformations.
Thousands of biochemical reactions with characterized activities are “orphan,” meaning they cannot be assigned to a specific enzyme, leaving gaps in metabolic pathways. Novel reactions predicted by pathway-generation tools also lack associated sequences, limiting protein engineering applications. Associating orphan and novel reactions with known biochemistry and suggesting enzymes to catalyze them is a daunting problem. We propose the method BridgIT to identify candidate genes and catalyzing proteins for these reactions. This method introduces information about the enzyme binding pocket into reaction-similarity comparisons. BridgIT assesses the similarity of two reactions, one orphan and one well-characterized nonorphan reaction, using their substrate reactive sites, their surrounding structures, and the structures of the generated products to suggest enzymes that catalyze the most-similar nonorphan reactions as candidates for also catalyzing the orphan ones. We performed two large-scale validation studies to test BridgIT predictions against experimental biochemical evidence. For the 234 orphan reactions from the Kyoto Encyclopedia of Genes and Genomes (KEGG) 2011 (a comprehensive enzymatic-reaction database) that became nonorphan in KEGG 2018, BridgIT predicted the exact or a highly related enzyme for 211 of them. Moreover, for 334 of 379 novel reactions in 2014 that were later cataloged in KEGG 2018, BridgIT predicted the exact or highly similar enzymes. BridgIT requires knowledge about only four connecting bonds around the atoms of the reactive sites to correctly annotate proteins for 93% of analyzed enzymatic reactions. Increasing to seven connecting bonds allowed for the accurate identification of a sequence for nearly all known enzymatic reactions.
- reaction similarity
- reactive site recognition
- orphan reactions
- novel (de novo) reactions
- sequence similarity
Genome-scale reconstructions of metabolic networks can be used to correlate the genome with the observed physiology, though this hinges on the completeness and accuracy of the sequenced genome annotations. “Orphan” reactions, which are enzymatic reactions without protein sequences or genes associated with their functionality, are common and can be found in the genome-scale reconstructions of even well-characterized organisms, such as Escherichia coli (1). Recent publications reported that 40 to 50% of the enzymatic reactions cataloged in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (2) lack an associated protein sequence (3, 4).
Problems with orphanlike reactions can also arise in areas such as bioremediation, synthetic biology, and drug discovery, where exploring the potential of biological organisms beyond their natural capabilities has prompted the development of tools that can generate de novo hypothetical enzymatic reactions and pathways (5?????????–15). These de novo reactions are behind many success stories in biotechnology and can be used in the gap filling of metabolic networks (6, 12, 13, 15??–18). While these enzymatic reactions have well-explained biochemistry that can conceivably occur in metabolism, they are essentially orphan reactions because they have no assigned enzyme or corresponding gene sequence. The lack of protein-encoding genes associated with the functionality of these de novo reactions limits their applicability for metabolic engineering, synthetic biology applications, and the gap filling of genome-scale models (19). A method for associating de novo reactions to similarly occurring natural enzymatic reactions would allow for the direct experimental implementation of the discovered novel reactions or assist in designing new proteins capable of catalyzing the proposed biotransformation.
Computational methods for identifying candidate genes of orphan reactions have mostly been developed on the basis on protein sequence similarity (3, 20?–22). The two predominant classes of these sequence-based methods revolve around gene/genome analysis (22??–25) and metabolic information (26, 27). Several bioinformatics methods combine different aspects of these two classes, such as gene clustering, gene coexpression, phylogenetic profiles, protein interaction data, and gene proximity, for assigning genes and protein sequences to orphan reactions (28??–31). All these methods use the concept of sequence similarity. Within this concept, homology between two sequences, one orphan and one well characterized, is inferred when the two share more similarity than would be expected by chance (32). Next, the biochemical function is assigned to the orphan protein sequence, assuming that homologous sequences have similar functions. This can be problematic because many known enzymatic activities are still missing an associated gene due to annotation errors, the incompleteness of gene sequences (33), and the fact that homology-based methods cannot annotate orphan protein sequences with no or little sequence similarity to known enzymes (3, 34). Moreover, sequence-similarity methods can provide inaccurate results because small changes in key residues could greatly alter enzyme functionality (35); in addition, it is a common observation that vastly different protein sequences can exhibit the same fold and, therefore, have similar catalytic activity, even though they look very different (36, 37).
These shortcomings motivated the development of alternative computational methods based on the structural similarity of reactants and products for identifying candidate protein sequences for orphan enzymatic reactions (31, 35, 38???–42). The idea behind these approaches was to assess the similarity of two enzymatic reactions via the similarity of their reaction fingerprints; that is, the mathematical descriptors of the structural and topological properties of the participating metabolites (43), which could eliminate the problems associated with nonmatching or unassigned protein sequences. In such methods, the reaction fingerprint of an orphan reaction is compared with a set of nonorphan reference reaction fingerprints, and the genes of the most-similar reference reactions are then assigned as promising candidate genes for the orphan reaction. Reaction fingerprints can be generated based on different similarity metrics, such as the bond change, reaction center, or structural similarity (42).
One class of reaction-fingerprint computational methods compares all of the compounds participating in reactions (42), which includes both reactants and cofactors. The application of this group of methods is restricted to specific enzymatic reactions that do not involve large cofactors (31, 35, 38???–42). This is because the structural information of the large cofactors overwhelmingly contributes to the corresponding reconstructed reaction fingerprint, and consequently, reactions with similar cofactors will inaccurately be classified as similar (35??–38).
Another class of reaction-fingerprint methods uses the chemical structures of reactant pairs for comparison (40). While these methods can be applied to all classes of enzymatic reactions, they neglect the crucial role of cofactors in the reaction mechanism. Moreover, neither of these two classes of methods has been employed for assigning protein sequences to de novo reactions (40).
In this study, we introduce a computational method, BridgIT, that links orphan reactions and de novo reactions predicted by pathway design tools such as BNICE.ch (16), Retropath2 (15), DESHARKY (10), and SimPheny (12) with well-characterized enzymatic reactions and their associated genes. BridgIT uses reaction fingerprints to compare enzymatic reactions and is inspired by the lock-and-key principle that is used in protein docking methods (44), wherein the enzyme binding pocket is the “lock” and the ligand is a “key.” If a molecule has the same reactive sites and a similar surrounding structure as the native substrate of a given enzyme, it is then rational to expect that the enzyme will catalyze the same biotransformation on this molecule. Following this reasoning, BridgIT uses the structural similarity of the reactive sites of participating substrates together with their surrounding structure as a metric for assessing the similarity of enzymatic reactions. It is substrate-reactive-site centric, and its reaction fingerprints reflect the specificities of biochemical reaction mechanisms that arise from the type of enzymes catalyzing those reactions. BridgIT introduces an additional level of specificity into reaction fingerprints by capturing critical information about the enzyme binding pocket. More precisely, BridgIT allows us to capture approximately the 2D structure of the enzyme binding pocket by incorporating the information about sequences of atoms and bonds around the substrate reactive site.
Through several studies, we demonstrated the effectiveness of utilizing the BridgIT fingerprints for mapping novel and orphan reactions to the known biochemistry. These reactions are mapped according to the enzyme commission (EC) (45) number, which is an existing numerical classification scheme for enzyme-based reactions. The EC number can classify enzymes at up to four levels, with a one-level classification being the most general and a four-level classification being the most specific, and these enzyme-based reactions are then represented by four numbers, one for each level, separated by periods (e.g., 220.127.116.11). We show that BridgIT is capable of correctly predicting enzymes with an identical third-level EC number, indicating a nearly identical type of enzymatic reaction, for 90% of orphan reactions from KEGG 2011 that became nonorphan in KEGG 2018. This result validates the consistency of the sequences predicted by BridgIT with the experimental observations, and it further suggests that BridgIT can provide enzyme sequences for catalyzing nearly all orphan reactions. For the remaining 10% of the orphan reactions, an in-depth sequence and structure analysis will be required to guide the sequence search and protein engineering because it is known from the enzyme analysis and classification that although reactions with common EC classification up to the third level have a nearly identical catalytic mechanism, they do not necessarily share the same sequences.
We also studied how the size of the BridgIT fingerprint impacts the BridgIT predictions. We show that BridgIT correctly identifies protein sequences using fingerprints that describe the neighborhood up to six bonds away from the atoms of the reactive site. Strikingly, we also find that it is sufficient to use the information of only three bonds around the atoms of the reactive sites of substrates to accurately identify protein sequences for 93% of the analyzed reactions.
Lastly, to indicate the utility of this computational technique, we applied BridgIT to the study of all of the 137,000 novel reactions from the ATLAS of Biochemistry, a database of all of the known and hypothetically possible biochemical reactions that connect two or more KEGG compounds (in version KEGG 2015) (46). Using our technology, we provide candidate enzymes that can potentially catalyze the biotransformation of these reactions to the research community, which should provide a basis for the engineering and development of novel enzyme-catalyzed biotransformations.
Results and Discussion
The BridgIT workflow, together with an example of its application on an orphan reaction, is demonstrated in Fig. 1. BridgIT is organized into four main steps (see Methods for more details): reactive site identification; reaction fingerprint construction; reaction similarity evaluation; and scoring, ranking, and gene assignment. The inputs of the workflow are (i) an orphan or a novel reaction and (ii) the collection of BNICE.ch generalized enzyme reaction rules. These reaction rules assemble biochemical knowledge distilled from the biochemical reaction databases and are used to discover de novo enzymatic reactions as well as predict all possible pathways from known compounds to target molecules (16, 46, 47). Here, we used the generalized enzyme reaction rules to extract information about the reactive sites of substrates participating in an orphan or a novel reaction, and then integrated this information into the BridgIT reaction fingerprints (Fig. 1, steps 1 and 2). We next compared the obtained BridgIT reaction fingerprints to the ones from the reference reaction database on the basis of the Tanimoto similarity scores (Fig. 1, step 3). A Tanimoto score near 0 designates reactions with no or low similarity, whereas a score near 1 designates reactions with high similarity. We used these scores to rank the assigned reactions from the reference reaction database and then identified the enzymes associated with the highest-ranked reference reactions as candidates for catalyzing the analyzed orphan or novel reaction (Fig. 1, step 4). In the following sections, we discuss the reconstructions and testing of the various components of BridgIT as well as the results of our main analyses. A web tool of BridgIT can be consulted at lcsb-databases.epfl.ch/pathways/Bridgit.
Reference Reaction Database.
The BridgIT reference reaction database is an essential component of the BridgIT workflow (Fig. 1). It consists of well-characterized reactions with associated genes and protein sequences and was built based on the KEGG 2016 reaction database (see Methods). The KEGG database is the most comprehensive database of enzymatic reactions and provides information about biochemical reactions together with their corresponding enzymes and genes. However, half of KEGG reactions lack associated genes and protein sequences and are thus considered to be orphan reactions. The BridgIT reference database was built using the KEGG reactions that (i) can be reconstructed by the existing BNICE.ch generalized reaction rules and are elementally balanced (5,270 reactions) and (ii) are nonorphan (5,049 reactions). This restriction removes reactions that lack characterized substrate reactive sites, meaning that they cannot be used in our comparisons. As a result, the reference reaction database contains information for 5,049 of 9,556 KEGG reactions (Dataset S1, Table S1).
Sensitivity Analysis of the BridgIT Fingerprint Size.
The defining characteristic of the BridgIT reaction fingerprint is that it is centered around the reactive site of the reaction substrate(s). The number of description layers in the BridgIT fingerprint—the fingerprint size—defines how large a chemical structure around the reactive site we consider when evaluating the similarity (see Methods). To investigate to what extent the fingerprint size affects the similarity results, we performed a sensitivity analysis in which we varied the fingerprint size between 0 and 10.
For this analysis, we considered the 5,049 nonorphan KEGG reactions that existed in the BridgIT reference reaction database. We started by forming reaction fingerprints that contained only the description layer 0 (fingerprint size 0) and evaluated how many of 5,049 nonorphan reactions BridgIT could correctly identify. That is, we evaluated whether the BridgIT algorithm with these reaction fingerprints could map each of these reactions to itself. We next formed the reaction fingerprints using only the description layers 0 and 1 (fingerprint size 1), and we performed the evaluation again. We repeated this procedure until the final step, in which we formed the reaction fingerprints with 10 description layers (fingerprint size 10).
As expected, the increase in the fingerprint size (i.e., specificity) led to a decrease in the average number of similar reactions assigned to the studied reactions. Moreover, the more description layers that were incorporated into the BridgIT fingerprint, the more accurately BridgIT matched the analyzed reactions (Table 1). For the fingerprint size 7, BridgIT correctly mapped 100% of the analyzed reactions; that is, each of the 5,049 nonorphan reactions was matched to itself in the reference reaction database. This indicated that the information about chains of eight atoms along with their connecting bonds around the reactive sites was sufficient for BridgIT to correctly match all nonorphan KEGG reactions, and we chose the fingerprint size 7 for our further studies.
BridgIT Reaction Fingerprints Offer Improved Predictions.
To evaluate BridgIT performances against existing approaches in this field (40, 42, 48), we performed two comparative studies. In the first study, we repeated the analysis from the previous section using the standard reaction difference fingerprint (Methods and SI Appendix, Fig. S4), which is used and discussed in detail in structure similarity methods such as RxnSim (38) and RxnFinder (39), to assess the benefits of introducing the information about the reactive site of substrates into the reaction fingerprints. A comparison of the two sets of predictions on 5,049 nonorphan reactions showed that the predictions obtained with BridgIT-modified fingerprints were significantly better than those obtained with the standard ones. BridgIT identified 100% of nonorphan reactions correctly versus the 71% success rate for the standard fingerprint method (Dataset S1, Table S4). Furthermore, BridgIT correctly matched 93% of the analyzed enzymatic reactions using the information about only four connecting bonds around the atoms of the reactive sites (fingerprint size 4) (Table 1), which exceeds the 71% of matched reactions when using the standard reaction fingerprints (fingerprint size 7).
The inferior performance of the standard reaction fingerprint method arose from three main sources. First, fragments from the substrate and product sets were cancelled out upon algebraic summation inside the fingerprint description layers (see Methods), in which description layers 0 and 1 define the single atoms and the connected pairs of atoms of the reactive site, and layers 2 to 7 include information about the chemical structure around the reactive site that contains up to eight atoms and seven bonds (Fig. 1). This cancellation occurred in all description layers (fingerprint size 7) for 246 nonorphan reactions—that is, their standard fingerprints were empty (Dataset S1, Table S3). As an example, Fig. 2A shows the standard reaction fingerprint of KEGG reaction R00722 that was empty for the standard fingerprint method. The information about reactive sites introduced in the BridgIT reaction fingerprints prevents such cancellations, since BridgIT does not include the atoms of the reactive site(s) in the process of the algebraic summation of the substrate and product set fragments (see Methods). As a result, BridgIT mapped R00722 to itself and identified R00330 as the most similar reaction to R00722 (Fig. 2A). Indeed, according to the KEGG database, the enzyme 18.104.22.168 catalyzes both reactions.
Second, the performance of the standard reaction fingerprint suffered because the first description layer of the standard fingerprint was empty for an additional 1,129 reactions, which indicated that these fingerprints did not represent the bond changes during the reaction (Dataset S1, Table S4).
Third, the remaining 89 mismatched nonorphan reactions had partial cancellations in the fingerprint description layers. For example, the standard fingerprint method incorrectly identified R03132 as the most similar to R00691, whereas BridgIT identified R00691 and R01373 as the most similar to R00691 (Fig. 2B), which matches the KEGG reports indicating that both R00691 and R01373 can be catalyzed by either EC 22.214.171.124 or EC 126.96.36.199.
In the second study, we compared the performance of BridgIT method against three state-of-the-art methods—EC-BLAST (42), Selenzyme (48), and E-zyme2 (40)—on three benchmark problems. The first two benchmark problems consisted of identifying the most-similar reactions to two example reactions, each representing a class of reactions that appear ubiquitously in biochemical networks. We chose R00722 (Fig. 2) to exemplify the first class of reactions characterized by a very similar structure of substrates and products, and chose R07500 to represent the class of multisubstrate multiproduct reactions (SI Appendix, Tables S1 and S2). The third benchmark problem represented the intermolecular transferases (EC 5.4.4) that catalyze the transfer of a hydroxyl group to another part of a molecule. Similar to other isomerases, in this class of reactions, the substrate and the product have the same chemical formula but different bond connectivity. We chose R09708 to exemplify this class of reactions (SI Appendix, Table S3).
For the three benchmark reactions, we ranked the similar reactions proposed by each of the methods according to the corresponding similarity scores, and the top 100 similar reactions proposed by each method were used for comparisons.
The most similar reaction proposed by BridgIT correctly matched the fourth-level EC number (188.8.131.52) of the first benchmark reaction R00722 (SI Appendix, Table S1). Three of four EC-BLAST variants (42) proposed a set of the reactions with the maximal similarity score (SI Appendix, Table S1). This set contained not only reactions that correctly matched the fourth-level EC number of R00722, but also reactions with EC numbers not even matching the first-level EC number of the benchmark reaction (SI Appendix, Table S1). The three variants of Selenzyme (48) proposed reactions that could match only the third-level EC number of R00722, whereas E-zyme2 (40) was unable to find a matching reaction due to very similar structures in the substrate–product pairs (SI Appendix, Table S1).
In the second benchmark, none of the investigated methods could propose reactions that match the EC number of R07500 (184.108.40.206) up to the fourth level, and all methods could match the third-level EC number for this reaction (SI Appendix, Table S2). BridgIT proposed 39 similar reactions matching the third-level EC numbers of R07500, whereas the EC-BLAST variant with structural similarity proposed 45 similar reactions, Selenzyme proposed 10, E-zyme2 proposed 9, and the three other EC-BLAST variants proposed 5 to 7 (SI Appendix, Table S2). In addition, we performed receiver operating characteristic (ROC) analysis on the sets of proposed similar reactions, and of all the compared methods, BridgIT had the highest area under the ROC curve (AUC) index of 0.95, meaning that it had the best performance among the compared methods for this class of reactions (SI Appendix, Table S2).
In the third benchmark, BridgIT was the only method that could match the third-level EC number of R09708 (220.127.116.11). It proposed linalool isomerase (18.104.22.168) to catalyze this reaction, and remarkably, it was reported in the literature that this enzyme could catalyze stereospecific isomerization of (3S)-linalool to geraniol (49). Other methods proposed catalyzing enzymes from EC class 4 (lyases) (SI Appendix, Table S3). Moreover, only BridgIT and the EC-BLAST variants with structural similarity and bond change could capture the structural changes in R09708, whereas E-zyme2, Selenzyme, and the two remaining variants of EC-BLAST could not map this reaction to itself (SI Appendix, Table S3).
The results of these three studies demonstrate the potential of BridgIT to outperform the currently available methods for enzyme annotation.
From Reaction Chemistry to Detailed Enzyme Mechanisms.
Approximately 15% of KEGG reactions (1,532 reactions) are assigned to more than one enzyme and EC number; that is, multiple enzymes can catalyze a specific biotransformation through different enzymatic mechanisms. For example, KEGG reaction R00217 is assigned to three different EC numbers, 22.214.171.124 (oxaloacetate carboxy-lyase) and 126.96.36.199 and 188.8.131.52 (both malate dehydrogenases), and the corresponding reactions involve different mechanisms (Fig. 3). The reaction mechanism of the 184.108.40.206 enzyme is well understood, as it belongs to the carboxy-lyases in which a carbon–carbon bond is broken and a molecule of CO2 is released. This enzyme can decarboxylate three different compounds: glutaconyl-CoA, methylmalonyl-CoA, and oxaloacetate (from this example). The overlapping reactive site of these three compounds is captured in the 4.1.1B rule of BNICE.ch (Fig. 3C). In contrast, the 220.127.116.11 enzyme found in bacteria and insects and the 18.104.22.168 enzyme found in fungi, animals, and plants are rather specific enzymes that decarboxylate oxaloacetate and malate with two different mechanisms. The decarboxylation is performed with (in the case of malate) or without (in the case of oxaloacetate) the incorporation of NAD+ as a cofactor. The only difference in the structure of these two molecules is in having either a ketone or an alcohol group on the second carbon. Consequently, the structure of the reactive site that these enzymes recognize has to reflect the difference between malate and oxaloacetate, and this is well captured in the 1.1.1A rule of BNICE.ch.
The 4.1.1B rule requires a less specific reactive site compared with the 1.1.1A rule, and these two rules have two different reaction fingerprints for catalyzing the same reaction R00217 because they describe different mechanisms for the same reaction.
Moreover, for 42% of the KEGG reactions that have a single enzyme assigned to them, BNICE.ch identified multiple alternative reactive sites and created multiple reaction fingerprints that describe the biotransformation of these reactions. Therefore, a single reaction from KEGG was translated into more than one fingerprint in the BridgIT reference database. This way, by preserving the information about enzyme binding pockets, the reconstructed BridgIT reference reaction database expands from 5,049 reactions to 17,657 reaction fingerprints corresponding to 17,657 detailed reaction mechanisms.
Currently, BridgIT is the only method that can distinguish different reaction mechanisms for the reactions catalyzed by different enzymes. As a consequence, BridgIT can propose distinct sets of protein sequences corresponding to distinct mechanisms and rank them according to the BridgIT score. The protein sequences can then be prioritized based on the BridgIT ranking, enzyme specificity, and the host organism.
Comparison of BridgIT and BLAST Predictions.
As a means to relate reaction structural similarity obtained using BridgIT with reaction sequence similarity obtained using BLAST (50), we applied these two techniques in parallel on a subset of reactions and their corresponding protein sequences from the reference reaction database. We compared the similarity results of BridgIT with those of BLAST and statistically assessed BridgIT performance using ROC curve analysis (SI Appendix, Figs. S1 and S2).
We chose E. coli BW29521 (EBW) as our benchmark organism for this analysis. There were 531 nonorphan reactions in EBW associated with 413 protein sequences. In total, there were 731 reaction–gene associations (Dataset S1, Table S2), as there were reactions with more than one associated gene and genes associated with more than one reaction. We removed all the nonorphan reactions of EBW from the BridgIT reference database and removed their associated protein sequences from the KEGG protein sequence database (Dataset S1, Table S2). We then used BridgIT to assess the structural similarity of the 531 EBW reactions to the BridgIT reference reactions using the Tanimoto score, and we applied BLAST to quantify the similarity of the 413 EBW protein sequences to the protein sequences of reactions from the BridgIT reference database using e-values. The concept of the validation procedure is illustrated in SI Appendix, Fig. S1. We provide a list of BridgIT reaction–reaction comparisons together with BLAST sequence–sequence comparisons (Dataset S1, Table S2).
Comparing Reaction (BridgIT) and Sequence (BLAST) Similarity Scores.
We considered two sequences to be similar if BLAST reported an e-value of less than 10?10 for their alignment. For a chosen discrimination threshold (DT) of the global Tanimoto score (TG), we considered the BridgIT prediction of similarity between an EBW reaction and a BridgIT reference reaction with a TG score as (i) true positive (TP) if TG > DT and their associated sequence(s) were similar (e-value < 10?10); (ii) true negative (TN) if not similar for both BridgIT (TG < DT) and BLAST+ (e-value > 10?10); (iii) false positive (FP) if similar for BridgIT (TG > DT) but not similar for BLAST+ (e-value > 10?10); and (iv) false negative (FN) if not similar for BridgIT (TG < DT) but similar for BLAST+ (e-value < 10?10).
We then counted the number of TPs, TNs, FPs, and FNs for all 531 reactions and summed these quantities to obtain the total number of TPs, TNs, FPs, and FNs per chosen DT. We repeated this procedure for a set of DT values varying across the interval between 0 and 1. Lastly, we used the total number of TPs, TNs, FPs, and FNs to compute the TP and FP rates for the ROC curve analysis (SI Appendix, Fig. S2A). The ROC curve indicated that the reaction comparison based on reaction structural similarity (BridgIT) was comparable to the one based on reaction sequence similarity (BLAST). Indeed, the obtained AUC score for the BridgIT classifier was 0.91, indicating that the similarities between the two methods were very high (SI Appendix, Fig. S2A). We next studied whether the type of compared reactions affected the accuracy of BridgIT predictions by categorizing reactions according to their first-level EC class, which indicates the broadest category of enzyme functionality, and then performing the ROC analysis for each class separately (SI Appendix, Fig. S2A). The analysis revealed that BridgIT performed well with all major enzyme classes, as represented by the high AUC scores ranging from 0.88 (EC 1) to 0.96 (EC 5).
We next analyzed the accuracy of BridgIT classification as a function of the DT of the Tanimoto score (SI Appendix, Fig. S2B). The accuracy ranged from 43% for a DT value of 0.01 to 85% for a DT value of 0.30. For DT values >0.30, the accuracy monotonically decreased toward 62% for a DT value of 1. The classifier was overly conservative for DT values >0.30 and was rejecting TPs (SI Appendix, Fig. S2B). More specifically, for a DT value of 0.30, the TP percentage was 38%, whereas for a DT value of 1, it was reduced to 3%. In contrast, the TN percentage increased very slightly for DT values >0.30, whereas for a DT value of 0.30, it was 46%, and for a DT value of 1, it was 57% (SI Appendix, Fig. S2B). Based on this analysis, we chose 0.30 as an optimal DT value for further studies.
A sensitivity analysis of BridgIT results to the variations in the e-value threshold ranging from 10?10 to 10?50 is provided in SI Appendix, Fig. S3.
BridgIT Analysis of Known Reactions with Common Enzymes.
The 5,049 reactions in the reference database were catalyzed by only 2,983 enzymes; that is, there were promiscuous enzymes that catalyzed more than one reaction. Of the 2,983 enzymes, 844 were promiscuous, catalyzing 2,432 of the reactions (Dataset S1, Table S5). Interestingly, BridgIT correctly assigned more than 80% of these 2,432 reactions to their corresponding promiscuous enzyme. An example of such a group is given in Table 2. This table shows the same enzymes listed across the top and down the left side, with the corresponding Tanimoto scores indicating the accuracy of BridgIT’s classifications. The overall high scores in this table indicate the accuracy of the enzyme assignments.
We investigated the remaining 20% of reactions in depth, and we observed that the Tanimoto scores of the first two description layers (see Methods) indicated a very low similarity between the reactions catalyzed by the same enzyme. This result suggested that such enzymes were either multifunctional (i.e., they had more than one reactive site) (Fig. 4) or were incorrectly classified in the EC classification system.
BridgIT Validation Against Biochemical Assays.
To assess BridgIT’s performance using biochemically confirmed reactions, we performed two validation studies on sets of orphan (study I) and novel (study II) reactions. Since the known reactions in KEGG are all experimentally confirmed using biochemical assays, we could use this pooled experimental data from hundreds of laboratories to demonstrate BridgIT’s ability to identify potential enzymes for catalyzing the biologically relevant orphan reactions on a large scale.
We compared the number of orphan reactions in two versions of the KEGG reaction database: KEGG 2011 and KEGG 2018. We found that 234 orphan reactions from KEGG 2011 were later associated with enzymes in KEGG 2018, meaning that they became nonorphan reactions (Dataset S1, Tables S6–S8). Since these newly classified reactions have been experimentally confirmed, we used these 234 reactions as a benchmark to evaluate BridgIT’s performance.
We formed the reference reaction database using the reactions from KEGG 2011 (see Methods), and we compared the BridgIT results with the KEGG 2018 enzyme assignments up to the third EC level. Remarkably, BridgIT- and KEGG 2018-assigned enzymes matched to the third EC level for 211 of 234 (90%) reactions (Dataset S1, Tables S6 and S7). This means that BridgIT accurately predicted the enzyme mechanism for enzymes that have been biochemically confirmed to catalyze a large majority of the orphan reactions in 2011. In addition, the set of protein sequences proposed by BridgIT comprised highly related protein sequences to the ones assigned to these enzymes in KEGG 2018.
The 234 reactions are catalyzed by 168 enzymes with specified fourth-level EC numbers in KEGG 2018. However, only 29 of these 168 enzymes were cataloged in KEGG 2011, and the remaining 139 enzymes had new fourth-level EC classes assigned in KEGG 2018, meaning that BridgIT had access to only the 29 enzymes that were classified in KEGG 2011 from which the reference reaction database was built. The 29 enzymes catalyzed 35 of the 234 studied reactions. For 29 of these 35 (83%) orphan reactions, the BridgIT algorithm predicted the same sequences that KEGG 2018 assigned to these reactions (Dataset S1, Table S9). A higher matching score when comparing up to the third EC level rather than the fourth EC level is likely because BridgIT uses BNICE.ch generalized reaction rules, which describe the biotransformations of reactions with specificities up to the third EC level.
The ATLAS of Biochemistry (46) provides a comprehensive catalog of theoretically possible biotransformations between KEGG compounds and can be mined for novel biosynthetic routes for a wide range of applications in metabolic engineering, synthetic biology, drug target identification, and bioremediation (40). We studied the 379 reactions from the ATLAS of Biochemistry that were novel in KEGG 2014 and were later experimentally identified and cataloged in KEGG 2018.
We formed the reference reaction database using the reactions from KEGG 2014 and applied BridgIT to these 379 reactions. For 334 of these 379 reactions, BridgIT proposed similar known reactions with a Tanimoto score higher than 0.30, thus providing promising protein sequences for enzymes catalyzing these reactions (Dataset S1, Table S10). For 14 of these novel reactions, BridgIT assigned the same sequences that were assigned in KEGG 2018 (Dataset S1, Table S11). An example of such a reaction is rat132341, which was a novel reaction in 2014 and later cataloged as R10392 in KEGG 2018 (Fig. 5A). The BridgIT analysis of this reaction revealed that R03444, which is catalyzed by enzyme 22.214.171.124, is the structurally closest reaction to this novel one, suggesting that protein sequences from EC 126.96.36.199 can catalyze this novel reaction. This was later confirmed by experimental biochemical evidence, as R10392 is associated with the same EC 188.8.131.52 enzyme in KEGG 2018. There are 243 available protein sequences for enzyme 184.108.40.206, and one sequence already has a confirmed protein structure (Fig. 5C). Therefore, BridgIT results were validated using experimental biochemical evidence on a large scale.
BridgIT Predictions for KEGG 2018 Orphan Reactions.
We applied BridgIT to the 810 orphan KEGG 2018 reactions that could be reconstructed using the BNICE.ch generalized reaction rules. The remaining 1,646 orphan reactions could not be reconstructed because they are not balanced or they lack the structure for at least one of their substrates. Remarkably, BridgIT identified corresponding reference reactions with Tanimoto scores higher than the optimal threshold value of 0.30 for 97% of the orphan reactions. The remaining 3% of orphan reactions had a low similarity with the reference reactions. A large number of the orphan reactions originate from the pathways toward plant and microbial natural products that frequently involve complex and less-investigated classes of enzymes such as polyketide synthases (PKSs), nonribosomal peptide synthetases (NRPSs), terpene cyclases (TCs), and cytochromes P450 (CYPs). Interestingly, BridgIT mapped 112 of 810 orphan reactions back to these families: It predicted that 72 orphan reactions can be catalyzed by CYPs, 33 by PKSs, 6 by NRPSs, and 1 by TC (Dataset S1, Tables S12–S15).
This result and the fact that BridgIT correctly mapped 100% of nonorphan KEGG reactions suggests that as our knowledge of biochemistry expands, the annotation of novel and orphan reactions using tools such as BridgIT will also improve.
BridgIT Predictions for ATLAS Novel Reactions.
We further utilized BridgIT to identify candidate enzymes for all the 137,000 de novo, orphanlike ATLAS reactions. These candidate enzymes can be used directly in systems biology designs if the matched enzymes perform the desired catalysis or if their amino acid sequences can be optimized through protein engineering to achieve the desired results. We found that 7% of novel ATLAS reactions were matched to known KEGG reactions with a Tanimoto score of 1 (perfect match), while 88% were similar to KEGG reactions with a Tanimoto score higher than the optimal threshold value of 0.30. Therefore, BridgIT could identify promising enzyme sequences for catalyzing 95% of novel ATLAS reactions. The remaining 5% of these reactions were not similar to any of the well-characterized known enzymatic reactions.
Finding well-characterized reactions that are similar to novel ones is crucial for evolutionary protein engineering as well as computational protein design, and methods like BridgIT can be instrumental in moving from a concept to the experimental implementation of de novo reactions. Additionally, to facilitate the experimental implementation of novel ATLAS reactions in metabolic engineering, in systems and synthetic biology, and in bioremediation studies, we can use the BridgIT similarity scores as confidence measures for evaluating the feasibility.
The results of the BridgIT analysis of the KEGG 2018 orphan and novel ATLAS reactions are available on the website lcsb-databases.epfl.ch/atlas/.
In BridgIT, the Tanimoto score is used to quantify the similarity of reaction fingerprints. BridgIT allows us to do the following: (i) compare a given novel or orphan reaction to a set of reactions that have associated sequences, subsequently referred to as the reference reactions; (ii) rank the identified similar reactions based on the computed Tanimoto scores; and (iii) propose the sequences of the highest ranked reference reactions as possible candidates for encoding the enzyme of the given de novo or orphan reaction.
Reactive Site Identification.
An enzymatic reaction occurs when its substrate(s) fits into the binding site of an enzyme. Since the structure and geometry of the binding sites of enzymes are complex and most of the time not fully characterized, we proposed focusing on the similarity of the reactive sites of their substrates. Following this, we used the expert-curated, generalized reaction rules of BNICE.ch to identify the reactive sites of substrates. These reaction rules have third-level EC identifiers (e.g., EC 1.1.1) and encompass the following biochemical knowledge of enzymatic reactions: (i) information about atoms of the substrate’s reactive site; (ii) information about connectivity (atom-bond-atom); and (iii) exact information of bond breakage and formation during the reaction. As of July 2017, BNICE.ch contains 381 bidirectional generalized reaction rules that can reconstruct 6,528 KEGG reactions (46).
Given a novel or orphan reaction, the reactive sites of its substrate(s) are identified in three steps. In the first step, the BNICE.ch generalized reaction rules that can be applied to groups of atoms from the analyzed substrates are identified, and then the information about the identified rules and the corresponding groups of atoms is stored. Subsequently, these groups of atoms are referred to as the candidate substrate reactive sites. In the second step, among the identified rules, only the ones that can recognize the connectivity between the atoms of the candidate substrate reactive sites are kept. In the third step, whether the biotransformation of a substrate(s) to a product(s) can be explained by the rules retained after the second step is tested. The candidate reactive sites corresponding to the rules that have passed the three-step test are validated and used for the construction of reaction fingerprints.
We illustrate this procedure on an orphan reaction R02763, which catalyzes the conversion of 3-carboxy-2-hydroxymuconate semialdehyde (substrate A) to 2-hydroxymuconate semialdehyde and carbon dioxide (Fig. 1). In the first step, 210 rules were identified out of 361 rules that could be applied to groups of atoms of substrate A (Fig. 1, step 1.a). Of the 210 rules, 168 matched the connectivity (Fig. 1, step 1.b). Lastly, the 168 reaction rules were applied to substrate A for bond breaking and formation comparisons, and one rule could explain the transformation of substrate A to the products (Fig. 1, step 1.c).
Reaction Fingerprint Construction.
Molecular fingerprints, which are the linear representations of the structures of molecules, have been used in many methods and for different applications, especially for structural comparison of compounds (51, 52). One of the most commonly used molecular fingerprints is the Daylight fingerprint (51), which decomposes a molecule into eight layers starting from layer zero that accounts only for atoms. Layer 1 expands one bond away from all the atoms and accounts for atom-bond-atom connections. This procedure is continued until layer 7, which includes seven connected bonds from each atom. There are two types of Daylight reaction fingerprints: (i) structural reaction fingerprints, which are simple combinations of reactant and product fingerprints, and (ii) reaction difference fingerprints, which are the algebraic summation of reactant and product fingerprints multiplied by their stoichiometry coefficients in the reaction. In this study, we propose a modified version of the reaction difference fingerprint. The procedure for formulating BridgIT reaction fingerprints is demonstrated through an example reaction (Fig. 1, step 2).
Starting from the atoms of the identified substrate reactive site, eight description layers of the molecule were formed, where different layers consisted of fragments with different lengths. Fragments were composed of atoms connected through unbranched sequences of bonds. Depending on the number of bonds included in the fragments, different description layers of a molecule were formed as follows:
Layer 0: Describes the type of each atom of the reactive site together with its count. For example, the substrate of the example reaction at layer 0 was described as three oxygens and five carbon atoms (Fig. 1, step 2.a).
Layer 1: Describes the type and count of each bond between pairs of atoms in the reactive site. In the example, the substrate at layer 1 was described with six fragments of length 1: one C–O, three C–C, two C=O, and one C=C bond (Fig. 1, step 2.a). Fragments are shown by their simplified molecular-input line-entry system (SMILES) molecular representation (53). To convert SMILES to canonical SMILES, we used Open Babel C++ library (52).
Layer 2: Describes the type and count of fragments with three connected atoms. While layers 0 and 1 described the atoms of reactive sites, starting from layer 2, atoms that were outside the reactive site were also described. In the illustrated example, there were six different fragments of this type (Fig. 1, step 2.a).
The same procedure was used to describe the molecules up to layer 7. Interestingly, and consistent with the previously reported result (43), we found that the seven-layer description was good enough to capture the structure of most of the metabolites in biochemical reactions, therefore providing a precise reaction fingerprint. Note that not all description layers are needed to describe less complex molecules. For example, product C (carbon dioxide) was fully described using only layer 0 and layer 1 (Fig. 1, step 2.a). For very large molecules, the description layers that contain fragments with more than eight connected atoms can be used.
For each layer, the substrate set was formed by merging all the fragments and their type and count in the substrate molecules of the reaction, and the product set was formed by merging all the fragments (type and count) in the product molecules of the reaction. In both sets, the count of each fragment was multiplied by the stoichiometric coefficients of the corresponding compound in the reaction. Lastly, the reaction fingerprints were created by summing the fragments of the substrate and product sets for each layer (Fig. 1, step 2.b).
Introducing the specificity of reactive sites into the reaction fingerprint allows BridgIT to capitalize on the information about enzyme binding pockets (16). To keep this valuable information throughout the generation of reaction fingerprints, BridgIT does not consider the atoms of the reactive site(s) when performing the algebraic summation of the substrate and product set fragments. Consequently, the BridgIT algorithm enables retaining, tracking, and emphasizing the information of the reactive site(s) in all the layers of the reaction fingerprint, which distinguishes it from the existing methods.
Reaction Similarity Evaluation.
The similarity of two reactions was quantified using the similarity score between their fingerprints, subsequently referred to as reaction fingerprints A and B. In this study, the Tanimoto score, which is an extended version of the Jaccard coefficient and cosine similarity, was used (54). Values of the Tanimoto scores near 0 indicate reactions with no or negligible similarity, whereas values near 1 indicate reactions with high similarity.
The Tanimoto score for each descriptive layer, TLk, together with the global Tanimoto score, TG, was calculated. The Tanimoto score for the k-th descriptive layer was defined as
where ak was the count of the fragments in the k-th layer of reaction fingerprint A; bk was the count of the fragments in the k-th layer of reaction fingerprint B; and ck was the number of common k-th layer fragments of reaction fingerprints A and B. Two fragments were equal if their canonical SMILES and their stoichiometric coefficients were identical. The global Tanimoto similarity score, TG, was defined as follows:
For each reaction fingerprint, its Tanimoto similarity score was calculated against the reaction fingerprints from the BridgIT reference database, which contained reaction fingerprints of all known, well-characterized enzymatic reactions (Fig. 1, step 3).
Sorting, Ranking, and Gene Assignment.
For a given input reaction, the reference reactions were ranked using the computed TG scores. The algorithm distinguished between the identified reference reactions with the same TG score based on the TL score of layers 0 and 1 and allowed the user to assign ranking weights to specified layers. The protein sequences associated with the highest ranked (i.e., the most similar) reference reactions were then assigned to the input reaction (Fig. 1, step 4).
We developed the computational tool BridgIT to evaluate and quantify the structural similarity of biochemical reactions by exploiting the biochemical knowledge of BNICE.ch generalized reaction rules. Because the generalized reaction rules can identify reactive sites of substrates, BridgIT can translate the structural definition of biochemical reactions into a type of reaction fingerprint that explicitly describes the atoms of the substrates’ reactive sites and their surrounding structure. Through the analysis of 5,049 known and well-defined biochemical reactions, we found that knowledge of the neighborhood up to three bonds away from the atoms of the reactive site can predict biochemistry and match catalytic protein sequences. The reaction fingerprints proposed in this work can be used to compare all novel and orphan reactions to well-characterized reference reactions and, consequently, to link them with genes, genomes, and organisms. We demonstrated through several examples the improvements that the BridgIT fingerprint brings to the field compared with the fingerprints currently existing in the literature.
A drawback of traditional sequence-similarity methods is that they cannot identify protein sequence candidates for de novo reactions, which we have shown BridgIT can do.
We tested BridgIT predictions against experimental biochemical evidence, within two large-scale validations studies on sets of 234 orphan and 379 de novo reactions. The reactions from these two sets were unknown in the previous versions of the KEGG database but were later experimentally confirmed and cataloged in KEGG 2018. BridgIT predicted the exact or a highly related enzyme for 89% of these reactions.
We further applied BridgIT to the entire catalog of de novo reactions of the ATLAS of Biochemistry database and proposed several candidate enzymes for each of them. The candidate enzymes for these de novo reactions can either be immediately capable of catalyzing these reactions or serve as initial sequences for enzyme engineering. The obtained BridgIT similarity scores can also be used as a confidence score to assess the feasibility of the implementation of novel ATLAS reactions in metabolic engineering and systems biology studies.
The applications of BridgIT go beyond merely bridging gaps in metabolic reconstructions: This method can be used to identify the potential utility of existing enzymes for bioremediation as well as for various applications in synthetic biology and metabolic engineering. As the field of metabolic engineering grows and metabolic engineering applications increasingly turn toward the production of valuable industrial chemicals such as 1,4-butanediol (55, 56), we expect that methods for the design of de novo synthetic pathways, such as BNICE.ch (16), and methods for identifying candidate enzymes for de novo reactions, such as BridgIT, will grow in importance.
?1N.H. and H.M. contributed equally to this work.
?2Present address: Institute for Environmental Sciences, Group of Environmental Physical Chemistry, Department F.-A. Forel for Environmental and Aquatic Sciences, University of Geneva, CH-1211 Geneva, Switzerland.
- ?3To whom correspondence should be addressed. Email: .
Author contributions: N.H., H.M., L.M., M.S., and V.H. designed research; N.H., H.M., and M.S. performed research; N.H., H.M., L.M., M.S., and V.H. analyzed data; and N.H., H.M., L.M., M.S., and V.H. wrote the paper.
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
Data deposition: The results of the BridgIT analysis of the KEGG 2018 orphan and novel ATLAS reactions are available at lcsb-databases.epfl.ch/atlas/.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1818877116/-/DCSupplemental.
- Copyright ? 2019 the Author(s). Published by PNAS.
This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND).
- Orth JD, et al.
- Delépine B,
- Duigou T,
- Carbonell P,
- Faulon J-L
- Hadadi N, et al.
- Overbeek R,
- Fonstein M,
- D’Souza M,
- Pusch GD,
- Maltsev N
- Pellegrini M,
- Marcotte EM,
- Thompson MJ,
- Eisenberg D,
- Yeates TO
- Chen Y,
- Mao F,
- Li G,
- Xu Y
- Pearson WR
- Galperin MY,
- Koonin EV
- Moriya Y, et al.
- Hu QN, et al.
- ?Delany J (2011) Daylight Theory Manual, Version 4.9 (DAYLIGHT Chemical Information Systems Inc., Mission Viejo, CA).
- Rogers DJ,
- Tanimoto TT
- International Union of Biochemistry and Molecular Biology;
- Hadadi N,
- Hafner J,
- Soh KC,
- Hatzimanikatis V
- Carbonell P, et al.
- Marmulla R,
- ?afari? B,
- Markert S,
- Schweder T,
- Harder J
- Briem H,
- Lessel UF
- Burgard A,
- Burk MJ,
- Osterhout R,
- Van Dien S,
- Yim H
- Andreozzi S, et al.