Well, DNA is really just a sequence of molecules callednucleotides, arranged into a particular shape (a double helix).Each nucleotide of DNA contains one of four different bases:adenine (A), cytosine (C), guanine (G), or thymine (T). Every humancell has billions of these nucleotides arranged in sequence. Someportions of this sequence (i.e. genome) are the same, or at leastvery similar, across almost all humans, but other portions of thesequence have a higher genetic diversity and thus vary more acrossthe population.
One place where DNA tends to have high genetic diversity is inShort Tandem Repeats (STRs). An STR is a short sequence of DNAbases that tends to be repeated back-to-back numerous times atspecific locations in DNA. The number of times any particular STRrepeats varies a lot among different people. In the DNA samplesbelow, for example, Alice has the STR AGAT repeated four times inher DNA, while Bob has the same STR repeated five times.
Using multiple STRs, rather than just one, can improve theaccuracy of DNA profiling. If the probability that two people havethe same number of repeats for a single STR is 5%, and the analystlooks at 10 different STRs, then the probability that two DNAsamples match purely by chance is about 1 in 1 quadrillion(assuming all STRs are independent of each other). So if two DNAsamples match in the number of repeats for each of the STRs, theanalyst can be pretty confident they came from the same person.CODIS, The FBI’s DNA database, uses 20 different STRs as part ofits DNA profiling process.
What might such a DNA database look like? Well, in its simplestform, you could imagine formatting a DNA database as atwo-dimensional list, wherein each row corresponds to anindividual, and each column corresponds to a particular STR.
STR_sequences = [‘AGAT’,’AATG’,’TATC’]DNA_database = [[‘Alice’,28,42,14],[‘Bob’,17,22,19],[‘Charlie’,36,18,25]]
The data in the above file would suggest that Alice has thesequence AGAT repeated 28 times consecutively somewhere in her DNA,the sequence AATG repeated 42 times, and TATC repeated 14 times.Bob, meanwhile, has those same three STRs repeated 17 times, 22times, and 19 times, respectively. And Charlie has those same threeSTRs repeated 36, 18, and 25 times, respectively.
So given a sequence of DNA, how might you identify to whom itbelongs? Well, imagine that you looked through the DNA sequence forthe longest consecutive sequence of repeated AGATs and found thatthe longest sequence was 17 repeats long. If you then found thatthe longest sequence of AATGs is 22 repeats long, and the longestsequence of TATC is 19 repeats long, that would provide pretty goodevidence that the DNA was Bob’s. Of course, it’s also possible thatonce you take the counts for each of the STRs, it doesn’t matchanyone in your DNA database, in which case you have no match.
In practice, since analysts know on which chromosome and atwhich location in the DNA an STR will be found, they can localizetheir search to just a narrow section of DNA. But we’ll ignore thatdetail for this problem.
Your task is to write a program that will contain atwo-dimensional list containing STR counts for a list ofindividuals. It will input a sequence of DNA then output to whomthe DNA (most likely) belongs.
In profile.py, implement a program that identifies to whom asequence of DNA belongs.
- Your program should print your name in an introductoryline (e.g., “Name here”‘s DNA Program)
- Your program should then ask the user for a DNA sequence andstore it in a string variable
- For each of the STRs (from the STR_sequences list), yourprogram should compute the longest run of consecutive repeats ofthe STR in the DNA sequence to identify.
- If the STR counts match exactly with any of the individuals inthe DNA_database, your program should print out the name of thematching individual.
- You may assume that the STR counts will not match more than oneindividual.
- If the STR counts do not match exactly with any of theindividuals in the DNA_database, your program should print “Nomatch”.
To test your program, we will use a much more simple databasewith smaller STR repeat sequences. Use the following array:
STR_sequences = [‘AGAT’,’AATG’,’TATC’]DNA_database = [[‘Alice’,5,2,8],[‘Bob’,3,7,4],[‘Charlie’,6,1,5]]
The following sequence should match with Alice:
The following sequence should match with Bob:
The following sequence should match with Charlie:
And the following sequence should have No match:
Answer to Well, DNA is really just a sequence of molecules called nucleotides, arranged into a particular shape (a double helix). …