New Research In
Articles by Topic
- Agricultural Sciences
- Applied Biological Sciences
- Biophysics and Computational Biology
- Cell Biology
- Developmental Biology
- Environmental Sciences
- Immunology and Inflammation
- Medical Sciences
- Plant Biology
- Population Biology
- Psychological and Cognitive Sciences
- Sustainability Science
- Systems Biology
广东快乐十分前三直选:Evolutionarily informed deep learning methods for predicting relative transcript abundance from DNA sequence
This article requires a subscription to view the full text. If you have a subscription you may use the login form below to view the article. Access to this article can also be purchased.
Machine learning methodologies can be applied readily to biological problems, but standard training and testing methods are not designed to control for evolutionary relatedness or other biological phenomena. In this article, we propose, implement, and test two methods to control for and utilize evolutionary relatedness within a predictive deep learning framework. The methods are tested and applied within the context of predicting mRNA expression levels from whole-genome DNA sequence data and are applicable across biological organisms. Potential use cases for the methods include plant and animal breeding, disease research, gene editing, and others.
Deep learning methodologies have revolutionized prediction in many fields and show potential to do the same in molecular biology and genetics. However, applying these methods in their current forms ignores evolutionary dependencies within biological systems and can result in false positives and spurious conclusions. We developed two approaches that account for evolutionary relatedness in machine learning models: (i) gene-family–guided splitting and (ii) ortholog contrasts. The first approach accounts for evolution by constraining model training and testing sets to include different gene families. The second approach uses evolutionarily informed comparisons between orthologous genes to both control for and leverage evolutionary divergence during the training process. The two approaches were explored and validated within the context of mRNA expression level prediction and have the area under the ROC curve (auROC) values ranging from 0.75 to 0.94. Model weight inspections showed biologically interpretable patterns, resulting in the hypothesis that the 3′ UTR is more important for fine-tuning mRNA abundance levels while the 5′ UTR is more important for large-scale changes.
?1J.D.W. and H.W. contributed equally to this work.
- ?2To whom correspondence may be addressed. Email: or .
Author contributions: J.D.W., M.K.M.-G., G.R., K.A.K., E.S.B., and H.W. designed research; J.D.W., K.A.K., E.S.B., and H.W. performed research; J.D.W. and H.W. contributed new analytic tools; J.D.W., R.V., and H.W. analyzed data; and J.D.W. and H.W. wrote the paper.
Reviewers: K.M.B., ETH Zürich; Z.B.L., Cold Spring Harbor Laboratory; and R.M., Corteva Agriscience.
The authors declare no conflict of interest.
Data deposition: The data reported in this paper has been deposited in the National Center for Biotechnology Information Sequence Read Archive database (accession no. PRJNA503076) and the Bitbucket repository (https://bitbucket.org/bucklerlab/p_strength_prediction/).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1814551116/-/DCSupplemental.
Published under the PNAS license.