DNA (genomic or genetic) data is vast and has little meaning without involving the processing and analysis of it to decipher meaning. With 3 billion DNA pairs to read in each unique human there is more information than anyone can feasibly analyse in a single lifetime without the support of computers. In order to convert a massive number of letters in sequence into something of value data processing undergoes three forms of analysis: primary, secondary, and tertiary.
Once the sequencer machine has completed its process it creates a data file that is analysed to determine the quality of each DNA letter read. When a sequence is created we have 40 copies of every region of DNA, we use these copies in this first step of analysis. The computer will compare all 40 copies of each DNA letter and determine, which the truest version of that DNA location is. In simpler terms, when a DNA sequence is being read, sometimes it can be misread, so we read it numerous times to be sure, it is like taking a ‘double-take’ when you are not sure of what you saw. In this way the computer will automatically attribute confidence scores (Phred quality scores) to each letter and give a final set of sequencing reads. For example, at a DNA location the machine read the letter as ‘A’ 38 times, ‘G’ once, and ‘T’ once, so the computer will record this as an A, with a high confidence score. The file coming from primary analysis is typically a FASTQ file.
The result of primary analysis is a file containing a lot of unorganized ‘reads’ of DNA. These reads are portions of the full DNA sequence and still need to be assembled into one sequence. The computer will align all of the reads produced from the sequencer against a reference genome. A ‘reference genome’ is a standard sequence that has been created from a number of individuals from a single species into one standard reference that can be compared against. Reference genomes are useful to increase the speed at which the computer can recognize how all of the reads should come together. This is like looking taking a big print-out of a finished puzzle and assembling the individual puzzle pieces against this image.
Once the sequence has been assembled we can also now see where all of the differences or ‘variants’ between an individual’s genome are with respect to the reference genome. This gives us the ability to highlight and list all of these differences into one file called a variant calling format (vcf).
Tertiary analysis is where most of the time is spent due to a hybrid manual and software based approach. The result of the secondary analysis provided a list of variants found between the individual genome and the reference genome, tertiary analysis is the process of finding significance and relevant meaning to these differences.
All variants are given clinical significance based on aggregated databases of published research and reviewed literature from within the medical field, globally. Based on the strength of evidence for variant-disease associations, a pathogenicity score or tag is given to each variant. This can be high, medium, low, or unknown clinical significance.
The number of variants in any given individual can be very high (4 to 5 million) so we apply filters to exclude findings that currently have little to no supporting evidence or are not relevant to the individual’s medical history.
The result of tertiary analysis is the report that clients receive. This report shows the variants found that are relevant to the client’s health categories of interest and can be shared with their healthcare provider if requested. These reports can be used to clarify a cause of disease or spark a discussion on preventative care with your healthcare providers.