Evolutionary History of SARS-CoV-2 Virus
Dr, Sujan Mamidi
HudsonAlpha Institute for Biotechnology, USA
Infection of SARS-CoV-2 virus was first detected in December 2019. This virus is a single stranded RNA virus (~28000 bp) and has affected about 52 million worldwide and caused 1.28 million deaths. More than 48,000 complete viral genome sequences are available to research community, an effort led by the GISAID consortium. These samples represent collection from about 100+ countries. This data helps scientists and epidemiologists understand genome diversity, evolution and spread of the virus, identify suitable targets for drug repositioning and to develop prevention strategies. A deep understanding of the evolutionary history helps us create remedies and to put new public policy in place. These studies are absolutely necessary to understand the differences in fatality rates across counties, which are due to different demographic compositions of the virus.
Comparison of the genomes revealed a very low mutation rate, at a rate of 25 mutations per genome per year. This low rate of mutation and low quality of the sequences, lack of outgroup makes it difficult to understand the evolutionary history of the virus. However, owing to the difficulties, two groups of scientists tried to understand the spatiotemporal distribution of virus and the evolutionary history.
Even with many efforts, the scientific community was not able to identify the first case, termed as “patient zero”. If identified, would aid in understanding how the virus may have jumped from the animal host to infect humans. Using novel phylogenetic approaches, Sudhir Kumar and team1 at temple university reconstructed the progenitor genome using about 30,000 genomes. They built a trial of mutations that automatically traces back to progenitor. The progenitor genome has 170 non-synonymous (mutations that cause an amino acid change in a protein) and 958 synonymous substitutions (mutations that don’t cause an amino acid change in a protein) compared with the genome of a closely-related coronavirus, RaTG13, which is generally found in a Rhinolophus affinis bat. This is about 96% sequence similarity of progenitor CoV2 with closely related coronavirus.
A subset of 49 mutations that occurred at 1% frequency or more were examined further to establish patterns and global spread. This progenitor genome is closest match to genomes that are sampled 12 days after the first sampled virus. Given the lowest mutation rate, multiple matches were found throughout the world, even those sequenced in April 2020 in Europe. About 120 genomes that are 100% similar with pro-Cov2 (at protein level), were analyzed in detail. Of these, 80 genomes are sampled in China and other Asian countries. This suggests that proCoV2 possessed the full repertoire of protein sequences necessary for infection and spread in the human population i.e. it did not need any mutation to become infectious. This proCoV2 virus has its initial descendants in China. They also identified six mutations in the samples isolated in Dec 2019, suggesting the existence of virus several weeks before the December 2019 cases.
Distribution of the virus:
Seven major evolutionary lineages arose after the pandemic began2, some in Europe and North America. Even though the virus originated in China, the sub-strain that have occurred in the middle east and Europe is re-infecting the Asian population much more. This led to higher sequence similarity of the strains in Asian and European regions. It is also found that North American coronaviruses harbor very different genome signatures than those prevalent in Europe and Asia. The classification is based on collection of specific mutation sites on the virus. The G and GR clades are prevalent in Europe (United Kingdom, Portugal), S and GH are observed in the Americas and some parts of Europe (Denmark, France). The “L” clade is mostly present in Asia, where the virus originated. Currently, the G clade and its derivatives, GH and GR, are most common clades amongst the sequenced viral genomes, accounting for 74% of all world sequences.
The original L clade originated in China in December 2019, followed by the appearance of the G clade in Europe in January 2020. G and G-derived clades (GR and GH) have then reached North America and Asia in March 2020 and are currently the fastest growing viral strains worldwide.
This study of mutations aid in the development of new antiviral therapies, and adaptation of current therapies to address the new molecular structures of the virus. For example, protein-based and RNA-based vaccines based on Spike region should consider all observed diversity of the Spike protein. Constant monitoring of mutations is important to track the movement of the virus between individuals and across countries.