New machine learning method detects animal coronaviruses that could infect humans

In a recent study published in the bioRxiv* preprint servers, researchers used machine learning (ML) tools to discover animal coronaviruses (CoVs), both alpha and beta CoVs, previously unknown to infect humans.

Study: Using machine learning to detect potentially infectious coronaviruses for humans.  Image Credit: MAVV/Shutterstock
Study: Using machine learning to detect potentially infectious coronaviruses for humans. Image Credit: MAVV/Shutterstock


It has remained difficult to predict which animal CoVs might infect humans because their full host range is unknown. For example, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in an animal host, most likely bats. After a host expansion event, an essential step in viral evolution, SARS-CoV-2 spread to humans. Therefore, it is critical to screen for all alpha and beta CoVs that infect animals close to humans (e.g., farm animals, such as pigs) that facilitate their zoonotic transmission.

Both alignment-based and no-alignment approaches have shown promise in addressing the problem of viral host prediction, but the former shows poor efficiency with increasing sequence length. Similarly, non-alignment methods do not take into account the relative position of amino acid (AA) residues in the sequence.

About the study

In the present study, researchers developed a novel machine learning model to predict the binding between the spike protein (S) of alpha and beta CoVs and a human receptor, such as human dipeptidyl peptidase 4 (hDPP4) and the enzyme of converting angiotensin 2 (ACE2).

To this end, they first downloaded 28,368 spike (S) protein sequences of all alpha and beta CoVs from the National Center for Biotechnology Information Virus database. They used a skip-gram model to convert this data into vectors encoding the association between adjacent protein sequences of length k called k-mers. Next, a classifier used these vectors to evaluate each protein sequence based on its human receptor binding potential, termed the human binding potential (h-BiP).

The final alpha and beta CoV dataset covering all their clades and variants had 2,534 AA sequences, according to which there were 1705 and 829 viruses annotated positive and negative for human binding, respectively. Therefore, the researchers split these 2,534 AA sequences into a training (85%) and testing (15%) set.

Additionally, researchers used a subset of 424 sequences to generate a phylogenetic tree for protein S of alpha and beta CoVs. The team used the initial receptor binding domain (RBD) structures of LYRa3 and LYRa11, generated using AlphaFold, for molecular dynamics (MD) simulations. The YASARA MD package helped simulate protein-protein interactions by substituting single AA residues and looking for minimum energy conformations on the final modified candidate structures. The team also performed an energy minimization (EM) routine for all modified candidate structures until the free energy stabilized within 50 Joule/mol. Due to the high accuracy of the classifier, the h-BiP score correlated with percent sequence identity (in %) against human viruses. The team calculated the pairwise percent sequence identity among all seven human CoV and protein S sequences in the study dataset to select the maximum for each. Notably, all viruses with ≥97% identity to previously known human CoVs had an h-BiP score >0.5.

Notably, the h-BiP score detected binding in cases of low sequence identity and discriminated between the binding potential for viruses with nearly the same sequence identity.

Results and conclusions

The researchers discovered LYRa326 and Bt13325, two viruses whose human-binding properties are still unknown, although they had high h-BiP scores. In support, phylogenetic analysis revealed that these two viruses were related to non-human CoVs previously known to bind to human receptors. Receptor binding motifs (RBM) within the receptor binding domain (RBD) of protein S come into direct contact with the host receptor. Multiple sequence alignment of the RBMs of Bt133 and LYRa3 with related viruses found that they conserve contact residues that interact with human receptors.

For example, Bt133 had retained all of its eight contact residues used by Tylonycteris bat CoV HKU4 (Ty-HKU4) to bind hDPP4 despite having 13 RBD mutations. Similarly, LYRa3, phylogenetically related to SARS-CoV Tor2, had conserved 12 of its 17 hACE2-binding contact residues. Furthermore, with the exception of residue 441, it had identical sequences to the RBD. MD simulations of the RBD further validated this binding and identified contact residues that bind human receptors.

Finally, the researchers tested whether this model looked at host expansion events. They emulated conditions before the advent of SARS-CoV-2 by removing all SARS-CoV-2 S protein sequences from the training set. They found that the retrained ML model successfully predicted the binding of a human receptor to wild-type SARS-CoV-2S, with an h-BiP score of 0.96. Overall, the proposed ML-based method could prove to be a valuable tool for detecting, from a large pool of animal CoVs, which viruses might cross the species barrier to infect humans.

*Important Notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be considered conclusive, guide clinical practice/health-related behavior, or treated as established information.

Add a Comment

Your email address will not be published. Required fields are marked *