finding string matches

Will_Shank · November 16, 2013, 12:01pm

You can improved the time to 112 seconds by using a dictionary for mapping instead of IndexOf. And I think it might move down to 20 seconds by stepping over the dna 8 bytes at a time instead of 1, searching a larger pool of offset mers and then doing a more complicated full match test accounting for that offset.

Kem_Tekinay · November 16, 2013, 4:57pm

I think you have to time the load too. I have no doubt your code is faster, but that would be the fair comparison.

How do your results compare the ones I posted above? (I won’t have time to look myself until much later, but I’m curious now. )

Bob_Keeney · November 16, 2013, 6:45pm

We ran a test on our RTF code in the Formatted Text Control this morning. RegExMBS was 4000 times faster doing the same thing using the Xojo RegEx class.

Kem_Tekinay · November 16, 2013, 6:55pm

Yes, I have a Feedback report about that.

Robert_Gross · November 17, 2013, 3:52am

Hello all -

I have incorporated the new code into my program. I used two different source sequences (7SK_NR_001445.2 and Air_NR_027773.1_splice_variant_3) and looked at only chromosomes 1 and 2 to keep the test time manageable. There was about at 3x improvement (58.05/18.77 = 3.09) with Kem’s code. The 3-fold improvement will make a big difference for us. Thanks Kem. Here are the results:

ORIGINAL CODE:
lncRNA name chromosome # matches run time
7SK_NR_001445.2 chr01part1 0 0.00 mins
7SK_NR_001445.2 chr01part2 0 0.02 mins
7SK_NR_001445.2 chr01part3 0 0.15 mins
7SK_NR_001445.2 chr01part4 3 0.40 mins
7SK_NR_001445.2 chr01part5 16 4.82 mins
7SK_NR_001445.2 chr01part6 1 0.02 mins
7SK_NR_001445.2 chr01part7 0 0.02 mins
7SK_NR_001445.2 chr01part8 0 0.00 mins
7SK_NR_001445.2 chr01part9 0 0.02 mins
7SK_NR_001445.2 chr01part10 0 0.00 mins
7SK_NR_001445.2 chr01part11 0 0.00 mins
7SK_NR_001445.2 chr01part12 0 0.13 mins
7SK_NR_001445.2 chr01part13 0 0.00 mins
7SK_NR_001445.2 chr01part14 15 2.77 mins
7SK_NR_001445.2 chr01part15 11 1.88 mins
7SK_NR_001445.2 chr01part16 0 0.00 mins
7SK_NR_001445.2 chr02part1 0 0.17 mins
7SK_NR_001445.2 chr02part2 0 0.07 mins
7SK_NR_001445.2 chr02part3 2 0.48 mins
7SK_NR_001445.2 chr02part4 33 3.23 mins
7SK_NR_001445.2 chr02part5 0 0.03 mins
7SK_NR_001445.2 chr02part6 0 0.03 mins
7SK_NR_001445.2 chr02part7 38 6.08 mins
7SK_NR_001445.2 chr02part8 1 0.27 mins
7SK_NR_001445.2 chr02part9 0 0.15 mins
Air_NR_027773.1_splice_variant_3 chr01part1 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr01part2 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part3 3 0.27 mins
Air_NR_027773.1_splice_variant_3 chr01part4 3 0.72 mins
Air_NR_027773.1_splice_variant_3 chr01part5 22 8.67 mins
Air_NR_027773.1_splice_variant_3 chr01part6 0 0.03 mins
Air_NR_027773.1_splice_variant_3 chr01part7 0 0.03 mins
Air_NR_027773.1_splice_variant_3 chr01part8 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part9 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr01part10 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part11 0 0.03 mins
Air_NR_027773.1_splice_variant_3 chr01part12 0 0.22 mins
Air_NR_027773.1_splice_variant_3 chr01part13 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr01part14 16 4.97 mins
Air_NR_027773.1_splice_variant_3 chr01part15 10 3.40 mins
Air_NR_027773.1_splice_variant_3 chr01part16 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr02part1 2 0.27 mins
Air_NR_027773.1_splice_variant_3 chr02part2 0 0.12 mins
Air_NR_027773.1_splice_variant_3 chr02part3 4 0.90 mins
Air_NR_027773.1_splice_variant_3 chr02part4 17 5.80 mins
Air_NR_027773.1_splice_variant_3 chr02part5 0 0.05 mins
Air_NR_027773.1_splice_variant_3 chr02part6 0 0.05 mins
Air_NR_027773.1_splice_variant_3 chr02part7 27 10.97 mins
Air_NR_027773.1_splice_variant_3 chr02part8 1 0.48 mins
Air_NR_027773.1_splice_variant_3 chr02part9 0 0.25 mins
Totals 225 58.05 mins

IMPROVED CODE:
lncRNA name chromosome # matches run time
7SK_NR_001445.2 chr01part1 0 0.00 mins
7SK_NR_001445.2 chr01part2 0 0.00 mins
7SK_NR_001445.2 chr01part3 0 0.05 mins
7SK_NR_001445.2 chr01part4 3 0.13 mins
7SK_NR_001445.2 chr01part5 16 1.52 mins
7SK_NR_001445.2 chr01part6 1 0.02 mins
7SK_NR_001445.2 chr01part7 0 0.00 mins
7SK_NR_001445.2 chr01part8 0 0.00 mins
7SK_NR_001445.2 chr01part9 0 0.00 mins
7SK_NR_001445.2 chr01part10 0 0.00 mins
7SK_NR_001445.2 chr01part11 0 0.00 mins
7SK_NR_001445.2 chr01part12 0 0.05 mins
7SK_NR_001445.2 chr01part13 0 0.00 mins
7SK_NR_001445.2 chr01part14 15 0.85 mins
7SK_NR_001445.2 chr01part15 11 0.60 mins
7SK_NR_001445.2 chr01part16 0 0.00 mins
7SK_NR_001445.2 chr02part1 0 0.05 mins
7SK_NR_001445.2 chr02part2 0 0.02 mins
7SK_NR_001445.2 chr02part3 2 0.15 mins
7SK_NR_001445.2 chr02part4 33 1.02 mins
7SK_NR_001445.2 chr02part5 0 0.00 mins
7SK_NR_001445.2 chr02part6 0 0.02 mins
7SK_NR_001445.2 chr02part7 38 1.88 mins
7SK_NR_001445.2 chr02part8 1 0.08 mins
7SK_NR_001445.2 chr02part9 0 0.05 mins
Air_NR_027773.1_splice_variant_3 chr01part1 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part2 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part3 3 0.08 mins
Air_NR_027773.1_splice_variant_3 chr01part4 3 0.23 mins
Air_NR_027773.1_splice_variant_3 chr01part5 22 2.87 mins
Air_NR_027773.1_splice_variant_3 chr01part6 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part7 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr01part8 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part9 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part10 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part11 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr01part12 0 0.07 mins
Air_NR_027773.1_splice_variant_3 chr01part13 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr01part14 16 1.63 mins
Air_NR_027773.1_splice_variant_3 chr01part15 10 1.12 mins
Air_NR_027773.1_splice_variant_3 chr01part16 0 0.00 mins
Air_NR_027773.1_splice_variant_3 chr02part1 2 0.10 mins
Air_NR_027773.1_splice_variant_3 chr02part2 0 0.03 mins
Air_NR_027773.1_splice_variant_3 chr02part3 4 0.30 mins
Air_NR_027773.1_splice_variant_3 chr02part4 17 1.92 mins
Air_NR_027773.1_splice_variant_3 chr02part5 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr02part6 0 0.02 mins
Air_NR_027773.1_splice_variant_3 chr02part7 27 3.60 mins
Air_NR_027773.1_splice_variant_3 chr02part8 1 0.17 mins
Air_NR_027773.1_splice_variant_3 chr02part9 0 0.08 mins
Totals 225 18.77 mins

Robert_Gross · November 17, 2013, 3:56am

Will… I have looked at your code and am not sure I understand what you are doing. Could you provide a brief explanation in plain English?

Thanks.

Bob

Will_Shank · November 17, 2013, 9:17am

I implemented that speed up to step 8 bytes each time and it now runs in 17 seconds, in a built app with pragmas it’s 13 seconds. Loading time is about 0.4 seconds. I’ll try to explain it, the testing and setup is more convoluted in this version but certain parts are cleaner.

First here’s the faster version http://home.comcast.net/~trochoid/code/DNASearcher2.zip

The main issue has to do with endianess. I only have access to a mac and on this machine a Ptr returns values in LittleEndian format, even though I set it’s MemoryBlocks LittleEndian False. I think Windows machines are big endian and then maybe Ptr returns different ordered values, in which case several parts of the algorithm need to be flipped. Basically, if you’re not on a mac I’m not sure what this endian behavior might be. The fix would be knowing the behavior

Another small issue with it not running on non-macs is the way I request input files
dim fdna As FolderItem = GetOpenFolderItem("")
I seem to remember a Windows user had problems with “” passed to GetOpenFolderItem. You might need to fix that to be able to select your files.

Now on to the algorithm. It’s not that complex but there’s many parts to track, I needed to make a few sketches to organize it in my head. Try drawing out the bytes and it should make more sense. The whole process is split in 2 parts, loading in OpenButton and the actual search in SearchButton.

LOADING
Loading first requests dna and query files. The dna data is simply read into Property dnaMem, a MemoryBlock. The data in dnaMem is actually padded 8 bytes either side, this is needed for the algorithm which should make sense when we get there.

Then it loads query data. First the query data is read into temporary variable mersInput. Then there’s 3 main loops that process that into the rest of the properties.

Loop 1 scans over mersInput, byte by byte, extracting each 15mer span. One copy is stored in property ‘displayMers As String’ (for display purposes when a match is found) and another copy is put into a padded MemoryBlock and stored in array buildMatches. At the end of the loop property displayMers lists all the 15mers (940 for your example data) and buildMatches lists padded MemoryBlocks of the same 15mers.

Loop 2 scans over buildMatches to build 5 Property arrays. These arrays represent windowed versions of the 15mer at each offset, this is what allows the search loop to step by 8. In the search, 8 bytes of dna are read and then a dictionary is queried for a match. For those 8 bytes to match a 15mer, all possible offset 8 byte spans of the 15mer need to be listed. If you draw it out you’ll see there’s 8 possible offsets and the middle 6 have 15mer data extending either side.

So each 15mer listed in displayMers/buildMatches is expanded to 8 entries in these arrays…

mersStrIdx() As integer - index into displayMers to print result or otherwise id this 15mer
mersOffset() As integer - the offset this entry was windowed at

mersHead(), mersMatch(), mersTail() As UInt64 - These 3 encode the 15mer data. For offset 0, mersHead has no data, the first 8 bytes of the 15mer are in mersMatch (used for lookup) and mersTail has the last 7 bytes of the 15mer. For offset 1 everything shifts over 1, so mersHead has the first byte of the 15mer, mersMatch has the next 8, and mersTail has the last 6. Continue shifting til you reach offset 7 where mersHead has the first 7 bytes, mersMatch the last 8, and mersTail has no data. These 3 UInt64s allow for a quick full 15mer test when the mersMatch value is detected.

Also, this is where the little endian issue comes in play. The UInt64s are read as LittleEndian to match what Ptr returns in the dna reads.

Another Also, the loop is structured to load all offset 7s first, then 6s, on down to 0. This naturally causes results to be collected in sorted order, ie by position.

Loop 3 scans the mersMatch array to build Dictionary mermapDict. This maps each unique mersMatch value to an integer array containing indices into those 5 arrays above. When the Dictionary returns a match this indice array is looped over to fully test each windowed 15mer.

Then a little bit of info is printed and loading is done. This all takes ~0.4 seconds for Rogers example data. (a whole 130MB file is read into a MemoryBlock in that time, computers are amazing)

Whew. I hope this makes sense. You should already have a picture how searching works.

SEARCHING
A bunch of variables are declared, a Ptr to dnaMem is had. The main trick are the variables dnaHead, dnaMatch and dnaTail as UInt64. Easier to explain with code…

[code]dim p As Ptr = dnaMem //get Ptr for faster reads

dnaTail = p.UInt64(8) //preload dnaTail, gets shifted into dnaMatch on first pass

dim hit As boolean = false //flag when a full match occurs

for i = 16 to last step 8 //for each 8 byte span of dna

dnaHead = dnaMatch //roll these down
dnaMatch = dnaTail
dnaTail = p.UInt64(i) //and load new tail

if mermapDict.HasKey(dnaMatch) then //see if any windowed 15mers match this 8 bytes of dna

idxArr = mermapDict.Value(dnaMatch) //found hit, so retrieve indices of matching windowed 15mers
  
for j = 0 to idxArr.Ubound  //for each to test
  k = idxArr(j)               //get index to windowed 15mer
  offset = mersOffset(k)      //grab offset

  if offset = 0 then          //compare head and tail UInt64s, switching comparison based on offset
    if mersTail(k) = (dnaTail and &h00FFFFFFFFFFFFFF) then hit = true
  elseif offset = 1 then
    if mersHead(k) = (dnaHead and &hFF00000000000000) and mersTail(k) = .... then hit = true
    //.....
  elseif offset = 7 then
    if mersHead(k) = (dnaHead and &hFFFFFFFFFFFFFF00) then hit = true
  end        

  if hit then
    //add results: the displayMers at index mersStrIdx(k) matched at position i-offset-8-7
    hit = false
  end

next

end

next[/code]

dnaHead and dnaTail is masked according to offset before comparing with mers datas (these masking &h literals are in reverse order because of how littleendian flips the bytes). When masked dna = the corresponding mers values we have a full match and the hit bit is set. Then if there’s a hit the results are stored.

I haven’t fully thought out the algorithms behavior at the loop start and end and various other indices. For example, the resulting position is i-offset-8-7 because I first thought i-offset-8 would do it but saw it was 7 off of Kems data so I just added -7. And maybe dnaMem doesn’t need to be padded at the front. There’s several things like that that need to be carefully checked over.

I was poking around genBank, they have a lot of analysis tools I don’t understand the names of. Seems they should already have something that does this search. Is that what ‘Blast’ does?

One last little tidbit. Microseconds is retrieved through method getMicro because it used to only update it’s values once per method, ie t1 always equaled t0. Getting it through a method would update the value but it doesn’t seem necessary anymore. Anyways, it’s just habit and I left it in there.

Kem_Tekinay · November 17, 2013, 9:50am

Wow Will, nice work. I was just about to post that I got it down to two minutes by indexing the query (as a Dictionary) rather than the source, but now I’m going to have to study your code in detail.

Will_Shank · November 17, 2013, 11:01am

Thanks Kem. I have the vaguest idea how Regex works, but maybe “indexing the query (as a Dictionary)” is making a similar pattern. I self studied some comp-bio years ago and this algorithm feels like a memory. There’s like a recursive search in it. The full 954 byte query is windowed into 940 15mers. Then each of those are windowed into mersHead/Match/Tail. I wonder if ‘windowing’ at different sizes or depths could make it faster.

Measured the times around this part of the loop. The relative time split appears to be 1 part p.UInt64(i), 1 part (test matches) and 7 parts HasKey(dnaMatch). So a faster dictionary could really help. Is Xojos Dictionary pretty good? Could it be tuned to run faster?

[code]for i = 16 to last step 8

dnaTail = p.UInt64(i)

if mermapDict.HasKey(dnaMatch) then

(test matches)

end

next[/code]

Will_Shank · November 17, 2013, 12:28pm

Does ElfDataDictionary still work? The link says its 3-7x faster. If that’s still true the dna search might run in 4-7 seconds.

Kem_Tekinay · November 17, 2013, 4:48pm

I remember doing a comparison of dictionaries on the NUG some time ago, and my recollection is that the native dictionary did well.

Kem_Tekinay · November 17, 2013, 5:05pm

BTW, I tried a much simplified version of your concept (index the query, not the source I’m kicking myself for not thinking of this myself) by creating a Dictionary tree of sorts. At the top level, the key is the UInt64 value at each position of the query. Within that is another Dictionary of the UInt64 value at pos + 7. This cut the search time to about 93 seconds, which I was all excited about until I saw your post.

I still have to go through your code, but two things: You don’t have to worry about text encoding since it’s all ASCII values anyway, and you don’t have to worry about endiness either as long as it’s consistent. I doubt either of these will make a difference though.

Jeff_Tullin · November 17, 2013, 5:51pm

I confess some of the code/ideas here went way over my head.
But might I offer a ‘simple’ solution, just in case it held, and in case I didn’t miss the point?

I THINK you are looking for string of 15 letters.
And for every occurrence.

So for an example if you were looking for APPLE, you would want to know where that started, and there will be many answers.
If the think you sought was APPLEWORDAPPLE then it might be possible for the second APPLE to be the start of another hit.

)If it wasn’t, you could have taken the string and used SPLIT, to end up with ‘all the bits between the word APPLE’)

So the process in use seems to be one of using INSTR?
Or you have been working through one by one and looking for APPLEWORDAPPLE starting at position (n)?

Heres my idea: shoot it down if you like.
get the string into a memory block, as you have done.

Then throw the memory block at a structure which is a string of (x) characters followed by many strings of (15) characters.
First pass, x is 0
Loop through the strings , checking for equality. if string(36) = ‘APPLEWORDAPPLE’

record the start positions as you find them.

Then do it again with x = 1 so that the whole shebang moves by one character.
And repeat

Thats 14 (?) passes through, checking (string length / 15) items for equality.

Would that work/be faster?

Will_Shank · November 19, 2013, 9:57am

Yes, this sounds like the same idea. If I understand it rightly you step down the dna 1 at a time, search the top level dict and then check the last 7 of all search results. This approach took 112 seconds for me, I guess you have a faster computer:) Changing it to step 8 at a time is where it really speeds up.

Here’s a 3rd version with tidier code and a further 15% speed up. Built app with pragmas is running in 11.4 - 11.6 seconds.
http://home.comcast.net/~trochoid/code/DNASearcher3.zip

The speed up changes are
-While loops instead of For loops
-masks are put in an array and retrieved by offset instead of a big “if offset = 0 then” switch
-results are collected as indices instead of constructing a string

The DefineEncoding is only for the strings that get displayed in the TextArea. Without it my EndOfLines disappeared and replaced with diamond question mark characters.

The problem was the behavior of Ptrs, they always return UInts in little-endian format on my Mac but I suspected on Windows or Linux Ptrs may return big-endian, so my code wouldn’t work there.

I fixed part of that by building the 15mer UInt64s through a Ptr so they would match dna read through a Ptr (originally 15mer UInt64s were made through MemoryBlock methods). The other problem is that the byte masks need to be different depending on the Ptrs endian, luckily there’s TargetLittleEndian/TargetBigEndian already built in!

With these changes this version should run on big-endian machines (can’t really test it though), pretty sure the previous two didn’t.

Markus_Winter · April 23, 2014, 7:29pm

@Will: why do you window and check each position of the 15mers? If you have a match of the first 8mer anyway wouldn’t it be faster to simply move up 6 and check the last 8mer at position +6?

Should be the same result but only two lookups and smaller dictionaries.

Will_Shank · April 24, 2014, 11:04am

Hi Markus, I’m not clear what you’re suggesting. What is only two lookups and moving up by 6? Looking back over the code this is how I remember the algorithm…

First the 15mer query is read in and the Dictionary setup. Basically, Dictionary Keys are all 8 byte sequences present in the 15mer query and Dictionary Values are an array of indices to all 15mers sharing that 8 byte sequence.

Then to scan the DNA 8 bytes are read and checked with the Dictionary. Any indices returned point to possible overlapping 15mers, so all of them are checked for a full match and saved if so. Then the next 8 bytes of DNA are read. That was the main speed up, stepping 8 bytes at a time instead of 1.

The bottleneck is still the Dictionary, I think 7/8ths of the total. Recently I read how Lookup() can be faster because it’s 1 call vs 2 for HasKey() and Value() but for some reason it slows down here significantly.

[code]Current HasKey ~8.5 sec

if mermapDict.HasKey(dnaMatch) then 
  idxArr = mermapDict.Value(dnaMatch) 
  //...

Lookup 1 ~10.0 sec

dim v As Variant

v = mermapDict.Lookup(dnaMatch, nil)
if v <> nil then
  idxArr = v
  //...

Lookup 2 ~10.3 sec

dim t() As integer

idxArr = mermapDict.Lookup(dnaMatch, t)
if idxArr <> t then
  //...[/code]

Rich_H · April 24, 2014, 11:35am

just an idea

what if DNA sequences was stored in dictionary or sql table as:

SEQ1 + SEQ2
SEQ2 + SEQ3
SEQ3 + SEQ4
SEQ4 + SEQ5

and so on.

this would allow searches to occur when a match is in between sequences while keeping the string to search small.

Kem_Tekinay · April 24, 2014, 1:34pm

Will, what is you use if not ( v is nil ) then instead of <>? Perhaps it’s that comparison that’s slowing it down.

Markus_Winter · April 24, 2014, 6:56pm

@ Will:

I’m trying to get my head around it but am failing.

To clarify:

we are looking for a 15mer like TGATTGTACCACATG in a much longer DNA like

TTATTGTTTTGTTACAAAATATGTGCCGATAGTGTAATGTGAGCATAAAT
GAGCTTAATGTTTTAGTGTAAATACAAACATGTAAACCACAAAATGTATG
ACCAATGTTGCCAAGAAGAAAGTTATTGCATAGCACATATTTTAAATTTT
CAACAAATCACATTTGAAAAACAAGATATGAATGTTAATTAATTAGTAAT
ATATTTTAGCTAATTCAATTATCTAAATTATGAATTCCAATATGCAATCA
GTATAATTACTAAGATATGCTGTATCCATTTTTTACCAAGTCTTCAAGAT
CTATATGTAGCTGATACATACATATCTCAATTTAGCTGCTAAGTTTTATG
GCCTTAATTTAGACCATAAACTGTACAGCTGGAAAAGCATATCTACCTTC
CAACCTTTTCTAAGTACACGTAAGTGCTTAAGTAAAACAATCAGTTATGA

To make it easier to follow let’s make the 15mer: thisisa15mernow

Some questions I have about the code in the openButton Action event:

tempMem = new MemoryBlock(32)

Why 32? We need 8+15+8 = 31. Is it faster to use a 32 Byte block?

Not a big problem, just thought I mention it.

Now this is where I get lost:

The following code turns each 15mer into 00000000thisisa15mernow00000000 and add it to buildMatches()

for i = 0 to last new15merData = mersInput.StringValue(i, 15) // get a 15mer tempMem = new MemoryBlock(32) // make a 32byte memory block tempMem.StringValue(8, 15) = new15merData // stuff the 15mer into the memory block as 00000000thisisa15mernow00000000 buildMatches.Append(tempMem) // add it to the list displayMers.Append(new15merData.DefineEncoding(Encodings.UTF8)) next

Now each of those padded memory blocks in the form 00000000thisisa15mernow00000000 is processed

[code] dim p As Ptr
for offset = 7 downto 0
for i = 0 to last // last = mersInput.Size - 15

  p = buildMatches(i)   // each is of the form 00000000thisisa15mernow00000000

  mersHead.Append(p.UInt64(offset))     
  mersMatch.Append(p.UInt64(offset+8)) 
  mersTail.Append(p.UInt64(offset+16)) 
  
  mersOffset.Append(offset)
  mersStrIdx.Append(i)
next

next[/code]

I’m not sure how the pointer works (interestingly in the debugger the pointer doesn’t seem limited, it just goes on and on and on) - unfortunately the documentation is “sparse”. But I guess you use an UInt64 as this returns 8 bytes?

So 00000000thisisa15mernow00000000 becomes

offset mersHead

7 0thisisa
6 00thisis
5 000thisi
4 0000this
3 00000thi
2 000000th
1 0000000t
0 00000000

offset mersMatch

7 15mernow
6 a15merno
5 sa15mern
4 isa15mer
3 sisa15me
2 isisa15m
1 hisisa15
0 thisisa1

Now as far as I can see the values in mersHead (and mersTail) will never be found in the DNA so I’m not sure what their point is.

On the other hand if there is a match to thisisa15mernow then ALL the mersMatch() will match.

Or reduced:

if mersMatch(0) = thisisa1 is found

then a check if mersMatch(7) = 15mernow is found at position +7 (?)

means the whole 15mer thisisa15mernow has been found.

Therefore only two lookups.

But I could be completely wrong about the whole thing so some clarification would be much appreciated.

Will_Shank · April 25, 2014, 6:57am

[quote=82359:@Markus Winter]So 00000000thisisa15mernow00000000 becomes

offset mersHead

7 0thisisa
6 00thisis
5 000thisi[/quote]

ah yes, that’s correct. mersHead and mersTail are for making a full match, mersMatch is just the initial 8 byte matching sequence. Here’s a better example of its workings.

Each 15mer is exploded into 8 entries in these parallel arrays

mersHead() left side of windowed 15mer mersMatch() inner part of windowed 15mer mersTail() right side of windowed 15mer mersOffset() offset of this windowing mersStrIdx() index to the 15mer that represents this windowing (ala displayMers) \\-- same as this 15mers starting position in the 15mer-query string

So if thisisa15mernow is the 12th 15mer in a query it might show up like this in the parallel arrays…

index in
parallel
arrays    offset    head      match     tail     StrIdx
11          7     0thisisa  15mernow  00000000     11
25          6     00thisis  a15merno  w0000000     11
39          5     000thisi  sa15mern  ow000000     11
53          4     0000this  isa15mer  now00000     11
67          3     00000thi  sisa15me  rnow0000     11
81          2     000000th  isisa15m  ernow000     11
95          1     0000000t  hisisa15  mernow00     11
109         0     00000000  thisisa1  5mernow0     11

When the dictionary is built it’ll have a Key for every unique entry in the match column and Value is an array of every index sharing this unique entry. Let’s say “isa15mer” appears only once, then…

dim idxArr() As integer = mermapDict.Value(“isa15mer”)
//idxArr has 1 element with value 53

If “isa15mer” appeared multiple times idxArr would have more elements.

Now an example of scanning dna for matches…
use bytes 16-23 of dna to lookup in dictionary

[code]
0 1 2 3
0123456789012345678901234567890123456789
DNA fourscoreandthisisa15mernowsevenyearsago
\------/\------/\------/\------/\------/
/ | \
/ | \
eandthis isa15mer nowseven
|
v
dictionary returns index 53

test each mer index against current dna data

head match tail
eandthis isa15mer nowseven current dna
0000FFFF . FFF00000 masks for offset 4 ie, mersOffset(53) = 4
| . |
v v v
0000this isa15mer now00000 masked dna

0000this isa15mer now00000 mers data at index 53[/code]

If the masked dna equals mers data then theres a full match and its stored. In the diagram only head and tail are compared, the match column is, well, already matched. Once all the indices have been tested like this the dna frame is ‘rolled’. By this I mean dnaMatch is copied to dnaTail, dnaHead copied to dnaMatch and only dnaHead is newly read from the master dna Memoryblock/Ptr. This keeps reads of dna to a minimum.

so next pass after rolling; only “yearsago” is read (bytes 32-39) but “nowseven” (bytes 24-31) is what’s passed to the dictionary for matching.

      0         1         2         3
      0123456789012345678901234567890123456789
DNA   fourscoreandthisisa15mernowsevenyearsago
      \\------/\\------/\\------/\\------/\\------/
                         /        |       \\
                        /         |        \\
                    isa15mer  nowseven  yearsago
                                  |
                                  v

also, to accommodate this rolling, dna data is padded 8 bytes both sides

//mem 16 bytes bigger than the file dnaMem = new MemoryBlock(bs.Length+16) //data starts at 8 dnaMem.StringValue(8, bs.Length) = bs.Read(bs.Length).Uppercase

[quote]On the other hand if there is a match to thisisa15mernow then ALL the mersMatch() will match.
Or reduced:
if mersMatch(0) = thisisa1 is found
then a check if mersMatch(7) = 15mernow is found at position +7 (?)[/quote]

I see, in a way this is what the rolling, masking and head/tail is doing but without having to byte shift any data. Re-reading or shifting data for position+7 is costly, instead a couple UInt64s are masked and compared with a couple more predefined UInt64s. This is why all the windowed head/tail UInt64s are made, so there’s no need to shift anything for a full match. That these are 15mers seems conspicuous, 16mers wouldn’t work in this algorithm and 14mers wouldn’t be as efficient.

Here’s the scanning loop with more comments

[code] dnaTail = p.UInt64(8) //init the rolling dna data
i = 16
while i < last //scan dna 8 bytes at a time, stepping 8 bytes

//roll dna frame
dnaHead = dnaMatch
dnaMatch = dnaTail 
dnaTail = p.UInt64(i) 

//if the dnaMatch sequence is found in the dictionary then...
if mermapDict.HasKey(dnaMatch) then 
  
  idxArr = mermapDict.Value(dnaMatch) //retrieve array of indices to the 15mers with this sequence
  idxArrLast = idxArr.Ubound
  
  j = 0
  while j <= idxArrLast  //for each index in the array
    k = idxArr(j)           //get the index
    offset = mersOffset(k)  //get the offset of the 15mer stored at this index
    
    //if the dna head and tail (masked for offset) equals the mers head and tail then we have a full match
    if mersHead(k) = (dnaHead and maskHead(offset)) and mersTail(k) = (dnaTail and maskTail(offset)) then
      resultsID.Append(mersStrIdx(k))   //save the 15mers id
      resultsPos.Append(i-offset-15)    //save starting position in dna where it matched
    end
    
    j = j + 1 //step to next mers index
  wend
  
end
i = i + 8 //step 8 bytes up the dna

wend[/code]