Oops. I made a mistake in my test and now it looks like Memoryblock is only about 3.1x faster than split/join. Processing the same 1000 amino sequence having 100 fragments 86000 times gave these results (in seconds)
total id cleaves collect subsequences
String Array: 68.767485 sum1: 2.316124 sum2: 65.501802
Memoryblock: 22.387569 sum1: 1.034649 sum2: 21.295609
Identifying cleavage points takes minimal time but of course it’s not applying real rules. I found a list of rules here http://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html#exceptions which I believe can be implemented by reading 8 bytes at a time from the sequence then masking and comparing/lookup for a match. The “not amino” rules complicate this but I don’t imagine it’d be much slower, or slower than split/replace.
Reading subsequences from the memoryblock takes the most time and this consists of 2 parts: summing the fragment lengths then read with Memoryblock.StringValue. The length summing I coded is naive, repeating for each peptide, and it’s about 20% of the subsequence collecting time…
sum lengths: 4.981102 read string: 20.825534
These times are exaggerated because of extra timing code, the length summing can be improved but I don’t see how Memoryblock.StringValue can be.
Since each protein is peptidized independently Karen’s idea of helper apps so you can parallel process 8 proteins at a time may offer the biggest speed up.
Also, while I’ve tried to replicate your scenerio my times are way faster than yours, even for split/join (1m vs 2h). These times are just of the algorithm. I mean the resulting peptide strings are passed to a method but that method is empty so that it doesn’t impact the measure. Creating instances, shoving in values, appending to an array… take significant time, 5 times what just creating the peptide data is. I think your real bottleneck might be instantiating many things.
To optimize you really need to pin down where the time is coming from. You said it takes over 2 hours in a real world example with each taking 0.1 seconds. How does that 0.1 second break down? Add timing points around certain parts and triangulate in on what’s actually taking the most time, and by how much. This can be tricky though. What I said above about creating instances adding time, appending to an array increased the cleave identifying part from 1 second to 7 even though it has nothing to do with the peptide array. There can be knock on effects like that that are difficult to predict and you wouldn’t notice unless measured.
Another thing is to test and quantify the possibilities. RegEx may not sound faster but it might be, it might be way faster. If what you’re after is time and RegEx is giving the best times then you can work on how to translate user settings to a RegEx query. Also test the possibility that you even need this data, what Norman suggested. All you really need are the start and length values of each peptide, the data can be retrieved from that and the original sequence when needed.
Here’s pseudo-code of how my test was structured with timing points. Uncommenting the storePep code greatly increases time.
[code]Sub Action()
dim seq As Memoryblock = getSequenceMem
sumCleaveTime = 0
sumCollectTime = 0
t1 = Microseconds
for i = 0 to 86000
testMem(seq)
next
t2 = Microseconds
printTimes
End Sub
Private Sub testMem(seq As MemoryBlock)
t1 = Microseconds
//===================identify cleaves 5%
dim fragIdx() As integer //start byte of fragment
dim fragLen() As integer //byte length of fragment
scanSeqAndBuildFragArrays
t2 = Microseconds
//========================== collect peptides 95%
'redim pepList(-1)
for each cleave size
for each fragment start
calcLengthSum //20% of 95%
storePep( seq.StringValue(fragIdx(i), lengthSum) ) //80% of 95%
next
next
t3 = Microseconds
sumCleaveTime = sumCleaveTime + (t2-t1)
sumCollectTime = sumCollectTime + (t3-t2)
End Sub
Property pepList() As Peptide
Private Sub storePep(s As String)
'dim c As new Peptide
'c.data = s
'c.prop1 = 5
'c.prop2 = 7.2
'pepList.Append c
End Sub[/code]