Dr. Sujan Mamidi
HudsonAlpha Institute for Biotechnology
DNA of an organism consists of many encoded instructions that are used by the cellular proteins and RNA machinery to enable the diverse functions of living cells and tissues. In simple terms, we have genes expressing only at certain time points and in certain tissues, not every gene transcribes in every single cell of a complex organism like humans. This is coded by multiple functional elements in the genome. To fully understand these functional elements in the assembled human genome, a consortium of scientists initiated ENCODE (The Encyclopedia of DNA Elements) in 2003. This is after the release of human genome in 2001 (The draft sequence, which covered more than 90 percent of the human genome, represents the exact order of DNA’s four chemical bases – commonly abbreviated as A, T, C and G – along the human chromosomes.) that has about 3.1 billion bases, and about 20,000 coding genes.
However, the coding genes only added up to 30 mega bases (1% of genome) and the rest of the genome was then considered “junk DNA”. But, many experiments suggested that variation outside genes is responsible for trait variation. The first phase of ENCODE (released in 2007) identified regions associated with transcription, histone modifications and open chromatin for this pre- specified 1% of the human genome, using several microarray-based studies. Phase II (released in 2012) interrogated the whole human genome and transcriptome using next generation sequencing technologies like RNA-seq (analyzes the transcriptome of gene expression patterns encoded within our RNA.), CHIP-seq (a method used to analyze protein interactions with DNA. It combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins) and other sequence-based approaches.
Assays from phase I were extended to multiple other cell lines and transcriptome analysis was performed in sub cellular compartments of the cells. In the third phase, released July 2020, identified several functional elements associated with gene regulation (transcription factor binding sites, open chromatin, and histone marks just to name a few), transcript isoforms, revealed landscapes of RNA binding and 3D organization of the chromatin. Phases 2 and 3 together used data from 9,239 experiments and from more than 500 cell types and tissues.
The enhanced view of human genome now has 20,225 protein-coding and 37,595 non-coding genes, about 2 million open chromatin regions, 750K regions with modified histones, 1.2 million regions bound by transcription factors and chromatin-associated proteins, a total of 0.9 million cis regulatory elements and many more. These all add up to more than 80% of genome, and have role in regulating gene expression, methylation and maintaining chromosome integrity. This sea of information will enhance our understanding the genetics of disease development, which will have major roles in drug development, genomic and personalized medicine.