finding string matches

Hello all -

I am searching DNA sequences with a short query string. The query is always 15 characters long, but the target DNA may be hundreds of millions of characters long. I need to find ALL the positions of matches in the DNA for that query string. Currently I am doing this using InStr but for some of my searches it is taking several days to run. I want to speed this up. I can index the DNA is some way like a dictionary, which I am creating right now but it seems that will take 502 days to complete! One can also use a suffix array, but that is way beyond my skill level to implement. I could also break the DNA into segments and search those segments, with code to look for matches that span two segments - my gut tells me this would help but I do not know if that is true.

I read that using a regular expression would speed things up compared to InStr. How do I search with a regular expression and get back an integer array with all the match positions in the DNA target? Note that I only have to run this analysis once, but there are tens of thousands of query strings I need to run.

Thanks in advance.

Bob

First, the native RegEx class is very slow for the type of search you’re describing. If you can use RegExMBS from the MBS plugins, it will be much faster, like potentially hundreds or even thousands of times faster.

Indexing seems to be the way to go in this case. I assume encoding will not be a factor so you can use a MemoryBlock to split the source into 15-character chunks. From there, you can append the position to an array that is assigned to a Dictionary where the key is the sequence.

My concern is that RAM will become factor so you might want to use a database instead. You’d only need one table with two fields: sequence and position, and just index sequence field.

Which one would you like to explore?

Hi Kem -

Indeed I can use MBS plugins so that seems the way to go. Note that breaking the DNA into 15 mers would not work because the match could be in DNA positions 1-15 or 2-16, or 3-17, etc. Also, it is often the case that a query might have many matching sites in the DNA… perhaps several thousand. I’m not sure how this works in a database. Do you put multiple locations stored in each database entry? I know creating a dictionary is painfully slow. Will creating a database be better?

This is something new for me, so I really don’t have much experience in it, but thank you for your willingness to help.

Bob

No, if anything, a database would be slower. But perhaps show the code for creating the Dictionary?

Or is is that your are crawling the source to create a value for every 15-character sequence at every position? In other words, if your source is ABCDEFGHIJKLMNOP, would you end up with two sequences, ABCDEFGHIJKLMNO and BCDEFGHIJKLMNOP?

Yes, I am “crawling the source” to create every possible 15-mer. I did not see any other way to do that.

Right, I have to think on that. My gut feeling is to put it into a MemoryBlock, then loop through it using StringValue to get every 15-byte sequence, but I can see how that might take a long time.

Is the source data something you can share? If not, can you post a snippet and tell me how many bytes it is exactly? I’d like to experiment.

I’d use something like one of the fast search algorithms
Some have initialization set up that could be slow so you might avoid one of those
Boyer-Moore string search algorithm
Knuth-Morris-Pratt string search algorithm
Boyer-Moore-Horspool string search algorithm
Apostolico-Giancarlo string search algorithm
Aho-Corasick multi-pattern string search algorithm
Rabin-Karp multi-pattern string search algorithm
Suffix trees

And a db isn’t a bad idea
You know each DNA character is at a position so you could simply write it into the db one character at a time in a table like
index | character

Then you can search for the first character of your sequence get its index then find the next X many and see if that matches your sequence) You already know how long the pattern is and if the indexes are sequential your can get them with a simple query (select * from table where index between X and Y)
Lather rinse & repeat

A DB based Boyer Moore

That might not be too bad

The DNA strings are actually parts of human chromosomes. They range in size from about 300,00 characters to about 200,000,000 characters (there are actually about 170 different strings that all together make up the human genome). I can put one of them on my Dropbox site if you want to access it. Do you want a long one or will any one do?

I’d love to play around with this.

Longest you have.

OK. Let me dig it out and compress it for you.

Norm -

I have heard of some of these, but I have been trying to avoid using them. I do not know what they are and would have to learn about them, and then I am not sure that implementing the code is within my capabilities. I know that some of my fellow bioinformaticians use suffix arrays for exactly this kind of searching.

The very first change to make is use InstrB instead of Instr. Should make a big difference.

Wouldn’t prebuilding a dictionary or prefix lookup take more memory than the DNA sequence itself? I guess 200MB will fit in ram but a lookup might take 4x as much space causing memory to be paged around. I don’t really know how much an issue this is, I try to keep things under 100MB.

Anyways, I did a quick test scanning a 50MB Memoryblock, it takes about 6 seconds and this might be faster with pragmas and a built app. First, convert the first 8 characters of the 15-mer into a UInt64. Then scan over the MemoryBlock using a Ptr which is faster. Step byte by byte but test 8 bytes at a time. When a match is found you’d have to check the remaining 7 for a full match.

[code]Property: mem As MemoryBlock

//pushbutton to create the memory
Sub Action()

dim last As integer = 1000000 * 50

mem = new MemoryBlock(last)

last = last - 9 //-9 because I’m too lazy to think about ending bounds :slight_smile:

dim p As Ptr = mem
for i As integer = 0 to last step 8
p.UInt64(i) = &h0121324484239299
next

End Sub

//pushbutton to scan
Sub Action()

dim p As Ptr = mem

dim last As integer = mem.Size - 9

dim match As integer = 0

for i As integer = 0 to last
if p.UInt64(i) = &h0121324484239299 then
match = match + 1
end
next

Msgbox "matches " + Str(match)

End Sub[/code]

I am having to regenerate the chromosome file because I had everything in a single “human genome file” that was 2.87 GB. I will post the files as soon as I can…

OK. I have everything set. You can get the files at: https://dl.dropboxusercontent.com/u/54596924/DNAstrings.zip

The actual problem I am working on is to start with a QuerySource string and look for every occurrence of a 15-mer within that QuerySource in a target DNA. Thus, I look for 1-15, 2-16, 3-17, etc from QueryString. The target DNA is stored in chromosomeDNA.txt. There are a series of lines of 100,020 characters representing the DNA sequence of part of human chromosome 2 in this case. Each line contains 100,000 characters from the DNA sequence + 20 characters from the next sequence line. In other words, line(0) = chars 1 thru 100,020 of the DNA; line(1) = chars 100,001 thu 200,020 of the DNA, etc. This allows one to search for queries that are up to 21 long without having to worry about wraparound to adjacent lines.

In practice, I read each DNA sequence line from the file into a string array and then do the searches one array element at a time.

I hope this makes sense. I am very open to ways to improve this process.

Bob

QuerySource is 954 bytes, so that’s a total of ~940 individual 15-mers that you want to find in chromosomeDNA right? Using Ptr scanning with UInt64s it took 11.8 seconds to search for 1 15-mer in that data. To search all 940 this way would be 3.2 hours.

I hadn’t realized the structure of the overlapping input before. Are all the 15-mer sequences in there unique?

In different sequences you can have any number of occurrences of a sequence. You can not predict what the biology is. Sometimes there are sequences of several hundred that might be present millions of times in our genome. In fact, there is one sequence of a few hundred in length that is repeated so many times that is constitutes 11% of our whole genome!

I initially did not have the overlapping sequence structure but had a single long sequence for the DNA target. That turned out o be very slow, though. Breaking the string into smaller pieces made things go faster.

Ooooooo, I just had a really good idea. Let me go see how it turns out.