Genomics and Big Data

July 19th, 2015

By Michael Agarwal

Life Sciences encompass a broad spectrum of topics with some overlap. All life sciences’ subject areas generate massive amounts of data originating from research, diagnostics, manufacturing, marketing and more. Consider for example, the field of genomics. Scientists and researchers would be interested in readily harnessing full genomic sequences of all living beings such as humans, mice, apes, dogs and cats for comparative analysis. It is critical to conduct comparative studies of the DNA of closely related species in order to understand and analyze their genetic differences. Storage and processing of huge amounts of data generated by the field of genomics (DNA sequencing, for example) presents several challenges using traditional relational database management systems. This is where big data meets genomics.

What is a human genome?

The functions of all cells in the human body are fully coded in the human genome. The human genome is comprised of a sequence of approximately 3 billion parts or nucleotides. These are organized into DNA molecules, which are represented by the familiar double helix. Nucleotides are represented by four letters: A for adenine, C for cytosine, G for guanine, and T for thymine. The codes form the sequence of amino acids that the body uses to build proteins. The proteins are the contributing factors for the work of the cells from development throughout life. They are responsible for both the physical attributes of a human being and their susceptibility to disease. A gene is a segment of a DNA molecule that codes for one complete protein. The human genome is carried on 23 different DNA molecules, or chromosomes, i.e., 22 autosomal chromosomes and one X or Y sex chromosome. Genomes of other species may contain fewer or more chromosomes and nucleotides, but they abide by the same basic organizational format as the human genome.

Amount of data stored in a human genome

Given our passion for using big data to assist genomic research, the question that some big data practitioners would be interested in knowing is: how much digital information is contained in a human genome? As we have mentioned in the first paragraph, the human genome is carried on 23 different DNA molecules, or chromosomes, i.e., 22 autosomal chromosomes and one X or Y sex chromosome. Based on x-ray crystallography data, James Watson and Francis Crick inferred the structure of the DNA molecule as consisting of two strands forming the double-helix. ["Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid" in the British journal Nature. April 25, 1953, volume 171:737-738.]

Note that the DNA must be chemically stable and must be capable of copying the information it contains. Hence, the two stranded structure of DNA. The four bases of DNA, namely, Adenine (A), Thymine (T), Cytosine (C), and Guanine (G) are always paired in such a way that Adenine connects to Thymine, and Cytosine connects to Guanine. Therefore, the four base pairs include A-T, T-A, G-C, and C-G. The haploid human genome consists of approximately 3 billion of these base pairs. A human being inherits two sets of genomes, one from the mother and one from the father, for a total of 46 chromosomes, representing the diploid genome, which contains about 6×10^9 base pairs. Roughly 10% of the genome consists of highly repetitive DNA, 25-30% is moderately repetitive and the rest is unique sequence DNA.

How does this translate into binary digits using 0s and 1s? Denoting each base pair using 2 bits will yield possible combinations of 00, 01, 10, and 11 in binary digits representing a total of 4 different base pairs. Hence, one byte or 8 bits will represent 4 DNA base pairs. The entire diploid human genome in bytes will be equal to 6×10^9 base pairs/diploid genome x 1 byte/4 base pairs = 1.5×10^9 bytes. That is, 1.5 Gigabytes (GB) of data is required to represent the entire diploid human genome. If we take 37.2 trillion cells (“There are 37.2 trillion cells in your body”, Smithsonian.com, October 24, 2013) as the estimated make up of a human body, then the amount of data stored in the human body will be 1.5 GB x 37.2 trillion cells = 55.8 trillion GB or 55.8 Zettabytes (ZB). Please see the data measurement chart below.