The EDNA Prototype

EDNA has developed a software program to facilitate safe blockchain-based storage of DNA. It's essential steps are outlined here on the right with some important background on the left.

DNA

Human DNA is made up of 3-billion pairs of molecules. The molecules come in 4 types and geneticists refer to them by the first letter of their scientific names (C, A, T & G). If you were to write out in a word document your entire genome, it would fill up a bit over 2 CD's of information with pairs of molecules... (A+T)(C+G)(T+A) repeated over and over 3-billion times

GRCh38

We humans have nearly all of our DNA in common. As a mater of fact any two of us will be found to differ from each other in only about 3 million places. That is a lot less data (only about 5 megabytes) just about as many one's and zeros as it would take to store 2 copies of the black and orange image shown near the bottom of this page. The GRCh38 is the latest record of the "most common" human DNA. It is stored and maintained by the US National Center for Biological Information (NCBI). The first thing the EDNA prototype does with your digital DNA is compare it to the GRCh38 and extract the variations from this "standard reference". This massively shrinks the amount of data EDNA has to work with and store.

Tagging

As the EDNA software is "walking" along an owners digital DNA, when it find a pair of letters that differs from the GRCh38, it creates a small file we call a tag. The file would look similar to the following: [89394822](C-G) [EDNA_S7LVHJBJNWdZG8bXcNZQ-ZRn5Riufn6KpeVDSK337Tbj7vCwEWAJ] The green number above is the molecular location of the variation we are recording, the red letters show the variation and the long cyan string of letters is a unique ID that lets us store and retrieve the information. In the case shown, the software found a variation at the beginning of the BRCA1 gene on Chromosome 10 and which normally begins... CCCAGGGGCCAAGCCTGCCCCCAGCCC

Tumbling

Tumbling is a process that was developed for privacy crypto-currency coins like Z-Cash and Monero. What is does is "mixes" the users tokens in with a large pile of other peoples tokens then sends the proper amounts back out the the users at random. This serves to bury the identity and ownership trail of the users holding the coins.

Putting it all Together

After the EDNA software processes a users DNA the result is a large group of tiny 1-line files that would look similar to this…

[89394822](A-T)[EDNA_R9LVHJBJNWdZG8bXcNZQ-ZRn5Riufn6KpeVDSfS67Tbj7vCwEWAJ]
[95594734](C-G)[EDNA_7JvxU7jFjhc2aCaTVK7sSTXpNP9RoSTH6PpCSNFej8i3yvTEUz]
[116438768](A-T)[EDNA_S7LVHJBJNWdZG8bXcNZQ-ZRn5Riufn6KpeVDSK337Tbj7vCwEWAJ]

(And around 3-million more files like the those shown above)

This process is run on servers that have no connection to the internet, no WiFi or bluetooth capability. The software also generates one larger file that is just a list of the cyan strings which begin EDNA_. This larger file is encrypted and placed on the Interplanetary File System (IPFS) and the decryption key for the file sent to the DNA owners EOS wallet. 

EDNA waits till we have saved up 5,000 DNA owners data as represented in the tiny files. We then tumble these files all together and mix them before sending at random the small files likewise to the IPFS storage system. The odds of someone being able to reconstruct anyone’s DNA without the private keys are 3-million times 3 million repeated 5,000 times to one making the EDNA storage strategy probably one of the most secure methods ever created.

What is almost as important as the security built into this design is the added feature that the raw data does not need to be encrypted to be kept safe, and can therefore be used by EDNA’s internal research program to add value to the DNA kept within the EDNA system.        

Just How Safe is This?

The graphic below illustrates a part of the human genome called Short Tandem Repeats (STR’s). Every human carries these, and it is from these sections of the DNA that forensic scientists discover someones identity in their DNA. In the US, the CODIS database stores 20 of these sections of DNA so that they can discover who left DNA at a crime scene. The odds of matching 20 of 20 of these STR’s and it not being the persons DNA are astronomical – and why DNA evidence in court is so convincing. When scientists match a large number of them (but not all 20) they know they are looking at a family member of the person they seek.    

The graphic below illustrates STR #1 and #2. With 11 to 18 more digits in the “combination” (depending on which countries database is being used) you can see the science requires a good amount of intact data and quality DNA to “get a match”. Further the STR’s are spread widely across the DNA, and appear in most of the chromosomes as can be seen in the bottom graphic.

Why is this so Important?

Look again at the EDNA data storage strategy above. The letters in the first file listed (A-T) could correspond to the fist blue box in the blue section of 10 for the first person in the graphic. The other 9 base pairs would be stored in other files, mixed in with millions and millions of other base pairs, and no possible way to put them together as belonging to the same person without the private keys. Shown below is just how scattered across the chromosomes  these STR’s are in the human genome. 

This is how EDNA keeps your identity and family relations safe while your data is on chain.