That’s in me!?
Maybe theres a crossed wire, I meant the part about there being 40 15mers isn’t right. 15mers overlap, the first starts at byte 0 extending 15 bytes, the next starts at byte 1 extending 15, the third starts at byte 2, etc…
While working up a complete map of this algorithm and it’s data structures I printed out the dictionary and noticed a very regular pattern. Every 8mer Key had 8 (or 16) indices to test, and the indices were regularly spaced. Making sense of this leads to a reframing of the problem as matching 8mers and a possible speed up.
But first here’s the old/original way I’ve been doing it with 15mers. This example uses a short input query of 19 bytes.
CCAGTACCAATGTACCCCA <- input query
Pulled out of this query are 5 15mers stored in displayMers() at the given index. This index value is the same as the starting byte position of the 15mer.
[code]
01234 ← starting byte positions
CCAGTACCAATGTACCCCA ← input query
0 CCAGTACCAATGTAC ← displayMers/individual 15mers, index = start byte
1 CAGTACCAATGTACC
2 AGTACCAATGTACCC
3 GTACCAATGTACCCC
4 TACCAATGTACCCCA[/code]
Because each 15mer needs to be matched at any 8 byte span they’re windowed/expanded into the mersHead/Match/Tail/Offset/StrIdx parallel arrays. An offset of 0 starts the 15mer aligned with the start of match. An offset of 1 starts the 15mer 1 byte into the head, and so on until offset 7 starts the 15mer so it ends at the end of match.
In the end the match column contains every 8mer inside every 15mer (this forms the redundency that can be optimized).
Also, they are listed with offset 7 down to 0 so the left-most versions of the 15mers will match first which causes the results to be collected in ascending order.
index
|
v offset head match tail strIdx <- parallel mer* arrays
0 7 0CCAGTAC CAATGTAC 00000000 0
1 7 0CAGTACC AATGTACC 00000000 1
2 7 0AGTACCA ATGTACCC 00000000 2
3 7 0GTACCAA TGTACCCC 00000000 3
4 7 0TACCAAT GTACCCCA 00000000 4
5 6 00CCAGTA CCAATGTA C0000000 0
6 6 00CAGTAC CAATGTAC C0000000 1
7 6 00AGTACC AATGTACC C0000000 2
8 6 00GTACCA ATGTACCC C0000000 3
9 6 00TACCAA TGTACCCC A0000000 4
10 5 000CCAGT ACCAATGT AC000000 0
11 5 000CAGTA CCAATGTA CC000000 1
12 5 000AGTAC CAATGTAC CC000000 2
13 5 000GTACC AATGTACC CC000000 3
14 5 000TACCA ATGTACCC CA000000 4
15 4 0000CCAG TACCAATG TAC00000 0
16 4 0000CAGT ACCAATGT ACC00000 1
17 4 0000AGTA CCAATGTA CCC00000 2
18 4 0000GTAC CAATGTAC CCC00000 3
19 4 0000TACC AATGTACC CCA00000 4
20 3 00000CCA GTACCAAT GTAC0000 0
21 3 00000CAG TACCAATG TACC0000 1
22 3 00000AGT ACCAATGT ACCC0000 2
23 3 00000GTA CCAATGTA CCCC0000 3
24 3 00000TAC CAATGTAC CCCA0000 4
25 2 000000CC AGTACCAA TGTAC000 0
26 2 000000CA GTACCAAT GTACC000 1
27 2 000000AG TACCAATG TACCC000 2
28 2 000000GT ACCAATGT ACCCC000 3
29 2 000000TA CCAATGTA CCCCA000 4
30 1 0000000C CAGTACCA ATGTAC00 0
31 1 0000000C AGTACCAA TGTACC00 1
32 1 0000000A GTACCAAT GTACCC00 2
33 1 0000000G TACCAATG TACCCC00 3
34 1 0000000T ACCAATGT ACCCCA00 4
35 0 00000000 CCAGTACC AATGTAC0 0
36 0 00000000 CAGTACCA ATGTACC0 1
37 0 00000000 AGTACCAA TGTACCC0 2
38 0 00000000 GTACCAAT GTACCCC0 3
39 0 00000000 TACCAATG TACCCCA0 4
Now the dictionary is built from the parallel arrays. The match column is used for Keys and the index added to the Value array, resulting in…
Keys Values <- mermapDict
CAATGTAC [0, 6, 12, 18, 24]
AATGTACC [1, 7, 13, 19]
ATGTACCC [2, 8, 14]
TGTACCCC [3, 9]
GTACCCCA [4]
CCAATGTA [5, 11, 17, 23, 29]
ACCAATGT [10, 16, 22, 28, 34]
TACCAATG [15, 21, 27, 33, 39]
GTACCAAT [20, 26, 32, 38]
AGTACCAA [25, 31, 37]
CAGTACCA [30, 36]
CCAGTACC [35]
In the dna scanning loop if the dictionary has a Key then it tells you which 15mers at what offset have that Key and each is tested for a full match using the head and tail data.
That’s the way I have it now. Maybe this is what you’ve been suggesting or it’s been mentioned before but there’s a redundency I’ve just noticed. Because each 15mer is offset 1 from the next 15mer and because each 15mer is windowed they end up sharing the same 8mers, just at different offsets.
What happens with a full size query is that most of the dictionaries index arrays are 8 elements long, pointing sequentially to the 15mers that are windowed around the same 8mer.
The problem with this is that all 15mers have to be tested even though they are representing the same matched 8mer region in the input query. Instead I’ve reframed my thinking around matching 8mers (that’s really what’s happening anyways). So for each 8mer in the input query build the 15mers that’ll window around it. That’s it really. Instead of building all 15mers then all 8mers of them, build all 8mers then all 15mers around them.
What this allows is once you find an 8mer match you get a single index to an array of premasked heads and tails around the 8mer. These can be binary searched for the offset and full match.
If done right this may lead to a reduction in the number of masking and comparisons needed for a full match, maybe by half or more. Since match testing constitutes 12% of scan time this change might be a 6% or more speed up, though it will be tricky to implement.
Here’s an example how it might work.
[code]
012 8 ← 8mer starting position / index
AAAACCCCGGGGTTTTAAAATTTTCCCC ← input query
0 AAAACCCC ← expands to these 8mers
1 AAACCCCG
2 AACCCCGG
…
8 GGGGTTTT
…[/code]
then for each 8mer, window all the 15mers around it…
[code]
head match tail
AAAACCCC GGGGTTTT AAAATTTT ← the 8mer at index 8 has these before and after regions
offset
00000000 GGGGTTTT AAAATTT0 0 ← precompute head/tail masking at all offsets around this 8mer
0000000C GGGGTTTT AAAATT00 1
000000CC GGGGTTTT AAAAT000 2
00000CCC GGGGTTTT AAAA0000 3
0000CCCC GGGGTTTT AAA00000 4
000ACCCC GGGGTTTT AA000000 5
00AACCCC GGGGTTTT A0000000 6
0AAACCCC GGGGTTTT 00000000 7[/code]
The dictionary gets filled with all 8mers as Keys but now each 8mer only needs 1 index in the Value array. That index points to the 8 element head and tail arrays for the 8mer.
Keys Values <- mermapDict
...
GGGGTTTT [8]
...
So passing GGGGTTTT into the dictionary returns index 8 which gives you those arrays. Mask and compare dna with those arrays to locate the offset in a binary search fashion, ie first try offset 4, if it passes try offset 6 otherwise offset 2, on down to the answer. The search can be done on just the head portion to locate an offset then test the tail at that offset for a full match. There’s lots of edge cases to consider and I’m not sure which way the binary search should branch but this could turn a constant 16 maskings and 16 comparisons per 8mer hit into maybe 3-6 maskings and 3-6 comparisons.