finding string matches

Will_Shank · April 25, 2014, 7:33am

That’s much better but still a bit slower than HasKey.

    if mermapDict.HasKey(dnaMatch) then      //8.5
      idxArr = mermapDict.Value(dnaMatch) 


    v = mermapDict.Lookup(dnaMatch, nil)    //8.8
    if not (v is nil) then
      idxArr = v


    v = mermapDict.Lookup(dnaMatch, nil)    //8.9
    if v is nil then
    else
      idxArr = v


    v = mermapDict.Lookup(dnaMatch, nil)     //9.6
    if v <> nil then
      idxArr = v


    idxArr = mermapDict.Lookup(dnaMatch, t)    //10.1
    if idxArr <> t then
    

    idxArr = mermapDict.Lookup(dnaMatch, t)    //10.9
    if not (idxArr is t) then

Markus_Winter · April 25, 2014, 7:39am

@ Will: thanks - will try to wrap my head around this.

You couldn’t by any chance write an article about pointers and memory blocks for XDev?

Markus_Winter · April 25, 2014, 8:46am

@ Will: Still trying to wrap my head around it but I >might< have found a potential problem:

There is a lot of repetition in DNA, some very important for medicine (like Huntington’s decease).

Lets say one of your searches is for a simple repeat sequence like TAATAATAA (made it shorter than 15mer for clarification)

In the DNA you have CCTAATAATAATAATAATAACC

So you should have matches at

CC TAATAATAA TAATAATAACC
CCTAA TAATAATAA TAATAACC
CCTAATAA TAATAATAA TAACC
CCTAATAATAA TAATAATAA CC

However by stepping over the found match (8 bytes step in the code) wouldn’t you miss those extra matches?

Markus_Winter · April 25, 2014, 8:49am

P.S. The original sample file isn’t at the link anymore. If anyone could send me a link to it then I would appreciate it.

Markus_Winter · April 25, 2014, 11:22am

[quote=46647:@Will Shank]QuerySource is 954 bytes, so that’s a total of ~940 individual 15-mers that you want to find in chromosomeDNA right? Using Ptr scanning with UInt64s it took 11.8 seconds to search for 1 15-mer in that data. To search all 940 this way would be 3.2 hours.

I hadn’t realized the structure of the overlapping input before. Are all the 15-mer sequences in there unique?[/quote]

I think there has been a miscommunication right there.

Robert was talking about overlapping sequences in the very long Chromosome DNA file (which he split into overlapping sequences of 100,000 characters long).

I don’t have the query source but I’m pretty sure it is 40 15mers (=900 byte) plus the 40 EndOfLines for a total of 940 bytes, like this

ATGCAGTTGACCATG
TGACCACATTGACAC
TTGAACACATGACAG

The 15mers should all be handled separately, there is no overlap.

Will_Shank · April 26, 2014, 12:07am

I put it on my ‘google drive’, been a long time since I used this, guessing it’ll work.
DNAstrings.zip - Google Drive 39MB

it expands to a 134MB chromosome file and a 954byte query file, both are just text full of GAGATGCCATGCCAATTCAAGGC…

If you run my code choose chromosome first then the query. Other way around and it’ll lock up trying to build 15mers for all 134MB!

Will_Shank · April 26, 2014, 12:16am

Oh no, there are 940 15mers in the 956byte query (that might be off by 1). The 15mers are every 15 byte sequence in the query, stepping byte by byte, so there’s lots of overlap. The endofline thing Robert talked about isn’t necessary and I don’t remember if they are in this sample data. I have to step out now but I can explain better later.

Matthew_Combatti · April 27, 2014, 1:30am

I wrote some visual Sequencing software about 2 years ago and still have the source to pull code from. Briefly perusing this post, what exactly are you trying to do? Would be glad to share whatever you need. The original raw dna was stored into txt files from the analyzer–> parsed into database format --> which allowed for rapid searching and visual dna molecule display (first version was canvas based…2nd used opengl) by search query as well as all sorts of protein manipulation and analysis for disease and trait procedures. You may PM me short detail of what you need.

Markus_Winter · April 27, 2014, 8:27pm

Thanks Matthew - pm is on its way shortly

Markus_Winter · April 27, 2014, 9:02pm

Thanks Will - am downloading it now.

P.S. Yup, I was wrong. The query file is a 954 bp fragment of chromosome 5 from humans.

Will_Shank · April 28, 2014, 1:39pm

That’s in me!?

Maybe theres a crossed wire, I meant the part about there being 40 15mers isn’t right. 15mers overlap, the first starts at byte 0 extending 15 bytes, the next starts at byte 1 extending 15, the third starts at byte 2, etc…

While working up a complete map of this algorithm and it’s data structures I printed out the dictionary and noticed a very regular pattern. Every 8mer Key had 8 (or 16) indices to test, and the indices were regularly spaced. Making sense of this leads to a reframing of the problem as matching 8mers and a possible speed up.

But first here’s the old/original way I’ve been doing it with 15mers. This example uses a short input query of 19 bytes.

CCAGTACCAATGTACCCCA  <- input query

Pulled out of this query are 5 15mers stored in displayMers() at the given index. This index value is the same as the starting byte position of the 15mer.

[code]
01234 ← starting byte positions
CCAGTACCAATGTACCCCA ← input query

0 CCAGTACCAATGTAC ← displayMers/individual 15mers, index = start byte
1 CAGTACCAATGTACC
2 AGTACCAATGTACCC
3 GTACCAATGTACCCC
4 TACCAATGTACCCCA[/code]

Because each 15mer needs to be matched at any 8 byte span they’re windowed/expanded into the mersHead/Match/Tail/Offset/StrIdx parallel arrays. An offset of 0 starts the 15mer aligned with the start of match. An offset of 1 starts the 15mer 1 byte into the head, and so on until offset 7 starts the 15mer so it ends at the end of match.

In the end the match column contains every 8mer inside every 15mer (this forms the redundency that can be optimized).

Also, they are listed with offset 7 down to 0 so the left-most versions of the 15mers will match first which causes the results to be collected in ascending order.

index | v offset head match tail strIdx <- parallel mer* arrays 0 7 0CCAGTAC CAATGTAC 00000000 0 1 7 0CAGTACC AATGTACC 00000000 1 2 7 0AGTACCA ATGTACCC 00000000 2 3 7 0GTACCAA TGTACCCC 00000000 3 4 7 0TACCAAT GTACCCCA 00000000 4 5 6 00CCAGTA CCAATGTA C0000000 0 6 6 00CAGTAC CAATGTAC C0000000 1 7 6 00AGTACC AATGTACC C0000000 2 8 6 00GTACCA ATGTACCC C0000000 3 9 6 00TACCAA TGTACCCC A0000000 4 10 5 000CCAGT ACCAATGT AC000000 0 11 5 000CAGTA CCAATGTA CC000000 1 12 5 000AGTAC CAATGTAC CC000000 2 13 5 000GTACC AATGTACC CC000000 3 14 5 000TACCA ATGTACCC CA000000 4 15 4 0000CCAG TACCAATG TAC00000 0 16 4 0000CAGT ACCAATGT ACC00000 1 17 4 0000AGTA CCAATGTA CCC00000 2 18 4 0000GTAC CAATGTAC CCC00000 3 19 4 0000TACC AATGTACC CCA00000 4 20 3 00000CCA GTACCAAT GTAC0000 0 21 3 00000CAG TACCAATG TACC0000 1 22 3 00000AGT ACCAATGT ACCC0000 2 23 3 00000GTA CCAATGTA CCCC0000 3 24 3 00000TAC CAATGTAC CCCA0000 4 25 2 000000CC AGTACCAA TGTAC000 0 26 2 000000CA GTACCAAT GTACC000 1 27 2 000000AG TACCAATG TACCC000 2 28 2 000000GT ACCAATGT ACCCC000 3 29 2 000000TA CCAATGTA CCCCA000 4 30 1 0000000C CAGTACCA ATGTAC00 0 31 1 0000000C AGTACCAA TGTACC00 1 32 1 0000000A GTACCAAT GTACCC00 2 33 1 0000000G TACCAATG TACCCC00 3 34 1 0000000T ACCAATGT ACCCCA00 4 35 0 00000000 CCAGTACC AATGTAC0 0 36 0 00000000 CAGTACCA ATGTACC0 1 37 0 00000000 AGTACCAA TGTACCC0 2 38 0 00000000 GTACCAAT GTACCCC0 3 39 0 00000000 TACCAATG TACCCCA0 4

Now the dictionary is built from the parallel arrays. The match column is used for Keys and the index added to the Value array, resulting in…

Keys Values <- mermapDict CAATGTAC [0, 6, 12, 18, 24] AATGTACC [1, 7, 13, 19] ATGTACCC [2, 8, 14] TGTACCCC [3, 9] GTACCCCA [4] CCAATGTA [5, 11, 17, 23, 29] ACCAATGT [10, 16, 22, 28, 34] TACCAATG [15, 21, 27, 33, 39] GTACCAAT [20, 26, 32, 38] AGTACCAA [25, 31, 37] CAGTACCA [30, 36] CCAGTACC [35]

In the dna scanning loop if the dictionary has a Key then it tells you which 15mers at what offset have that Key and each is tested for a full match using the head and tail data.

That’s the way I have it now. Maybe this is what you’ve been suggesting or it’s been mentioned before but there’s a redundency I’ve just noticed. Because each 15mer is offset 1 from the next 15mer and because each 15mer is windowed they end up sharing the same 8mers, just at different offsets.

What happens with a full size query is that most of the dictionaries index arrays are 8 elements long, pointing sequentially to the 15mers that are windowed around the same 8mer.

The problem with this is that all 15mers have to be tested even though they are representing the same matched 8mer region in the input query. Instead I’ve reframed my thinking around matching 8mers (that’s really what’s happening anyways). So for each 8mer in the input query build the 15mers that’ll window around it. That’s it really. Instead of building all 15mers then all 8mers of them, build all 8mers then all 15mers around them.

What this allows is once you find an 8mer match you get a single index to an array of premasked heads and tails around the 8mer. These can be binary searched for the offset and full match.

If done right this may lead to a reduction in the number of masking and comparisons needed for a full match, maybe by half or more. Since match testing constitutes 12% of scan time this change might be a 6% or more speed up, though it will be tricky to implement.

Here’s an example how it might work.

[code]
012 8 ← 8mer starting position / index
AAAACCCCGGGGTTTTAAAATTTTCCCC ← input query

0 AAAACCCC ← expands to these 8mers
1 AAACCCCG
2 AACCCCGG
…
8 GGGGTTTT
…[/code]

then for each 8mer, window all the 15mers around it…

[code]
head match tail
AAAACCCC GGGGTTTT AAAATTTT ← the 8mer at index 8 has these before and after regions

                          offset

00000000 GGGGTTTT AAAATTT0 0 ← precompute head/tail masking at all offsets around this 8mer
0000000C GGGGTTTT AAAATT00 1
000000CC GGGGTTTT AAAAT000 2
00000CCC GGGGTTTT AAAA0000 3
0000CCCC GGGGTTTT AAA00000 4
000ACCCC GGGGTTTT AA000000 5
00AACCCC GGGGTTTT A0000000 6
0AAACCCC GGGGTTTT 00000000 7[/code]

The dictionary gets filled with all 8mers as Keys but now each 8mer only needs 1 index in the Value array. That index points to the 8 element head and tail arrays for the 8mer.

Keys Values <- mermapDict ... GGGGTTTT [8] ...

So passing GGGGTTTT into the dictionary returns index 8 which gives you those arrays. Mask and compare dna with those arrays to locate the offset in a binary search fashion, ie first try offset 4, if it passes try offset 6 otherwise offset 2, on down to the answer. The search can be done on just the head portion to locate an offset then test the tail at that offset for a full match. There’s lots of edge cases to consider and I’m not sure which way the binary search should branch but this could turn a constant 16 maskings and 16 comparisons per 8mer hit into maybe 3-6 maskings and 3-6 comparisons.

Markus_Winter · April 28, 2014, 2:23pm

I thought if you checked the first 8mer and it is there then just checking the last 8mer (at offset +6 or +7?) would be enough to see if the whole 15mer is there. No need to go over all the intermediate matches.

Will_Shank · April 28, 2014, 2:39pm

That’s the challenge, you don’t know what the offset is. The task is to identify any 15mers from the query in the dna. 8mers are used for initial matching because they are quick to pull and locate with a dictionary, but when you get a match you don’t know if or where a full 15mer match may be around it. The 15mer may be at offset 0, or 1, or 2… The old strategy was testing all offsets 0-7, this new one allows for a binary search.

Will_Shank · April 28, 2014, 2:59pm

Also, remember that the dna is pulled 8 non-overlapping bytes at a time. If this is the dna…

AAAAAAAACCCCCCCCGGGGGGGGTTTTTTTT

only these get sent to the dictionary for hits

AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT

And if say GGGGGGGG matches an 8mer in the query then the surrounding area needs to be searched for a 15mer match.

Markus_Winter · April 28, 2014, 3:47pm

That's the challenge, you don't know what the offset is.

Thanks - got it now.