Fastest way to split & concatenate strings?

Markus_Winter · October 13, 2014, 8:31am

To clarify:

So ABCDEFG is the sequence of a protein.

You have literally billions of proteins with that sequence that are digested (aka split up)

A B C D E F G AB BC CD DE EF FG ABC BCD CDE DEF EFG ABCD BCDE CDEF DEFG are the different ways in which that split can happen

So you end up with all the possible fragments.

Each fragment of the protein is then turned into its own object (of type peptide), e.g. the fragment EFG becomes
pep.sequence = EFG
pep.Length = 3
pep.mass = 312.856242
pep.hydrophobicity = 2.7
pep.charge = 0
pep.Charge_LCMS = 1
etc

Michel_Bujardet · October 13, 2014, 8:57am

You can set that to ASCII encoding to make certain you have one byte per character. But normally, if there are no other characters than ABCDEFG and space, this is already the case.

Kem_Tekinay · October 13, 2014, 10:22am

In support of Michel’s response, the beauty of UTF-8 is that any character below code point 128 is single-byte and can be treated like a traditional ASCII-encoded string. In Markus’ case, encoding won’t come into play at all.

Markus_Winter · October 13, 2014, 10:29am

Thanks Kem for clearing this up. I thought that would be the case but wasn’t sure.

Will_Shank · October 13, 2014, 11:58am

I don’t know how this equates but I tested split/join vs memoryblock reads using a 1000 letter protein sequence containing ~100 R’s and K’s, in a built app with all pragmas on and found memoryblock 60x faster.

In the test the Memoryblock algorithms input sequence doesn’t contain underscores. Does your input sequence automatically have those underscores or are you adding them? Still, if you have to make a version without underscores it’ probably be faster.

In your example sequence there’s a K without a trailing _. Is that a mistake or are _'s only after certain R’s and K’s?

seq = "MLK_DEHTR_WASR_QLCK_ILTVADESSAKPDK_ASSR_" ^ K

KarenA · October 13, 2014, 12:23pm

He is trying to predict what peptides will be observed by a Mass Spectrometer after the protein is digested, basically to see if the the data produced is consistent with a particle protein

in looking at all the possible peptides formed when you digest a protein with an enzyme, the digestion is not perfect… Some cleavages can be missed for several reasons.

In some cases it depend what other amino adds are around the amino acid where cleave normally occurs. Because it is an enzyme doing the cleavage, think about the lock and key model you head of in school. If the cleavage site has certain amino acids right around it, they affect the 3D shape of the molecule at that point and the key does not fit. (and the software knows about that)

In some cases it is simply pure chance… This type of software typically lets the user specify how many missed cleavages to consider when cutting up the protein

Markus_Winter · October 13, 2014, 12:26pm

Depending on the protease used there can be quite a number of rules to apply, e.g. “After K but not when followed by P” (like in this case)

I basically apply each rule by adding/removing underscores using ReplaceAll. Probably not the smartest choice. Might be better to just make a list of positions

Markus_Winter · October 13, 2014, 12:28pm

P.S. Remembered: One thing to consider: the end user has to be able to specify / modify the rules. That’s why I stuck with simple ReplaceAll rules where an undersscore denotes the cut site. Originally used spaces but then the users couldn’t see the site.

Norman_P · October 13, 2014, 7:02pm

the fastest way to split & concat is to NOT do it at all - ever

create a bunch of tabular look ups into the original data - start point, length etc that define a “chunk” from the original string

shoving it all in a memory block might give you some speed ups for byte searching etc but I’d seriously rethink splitting to create many many repetitions of the same strings over & over with only minor differences (one extra char)

but thats just my 2 cents

Markus_Winter · October 13, 2014, 7:22pm

It is not just one extra char. Each letter stands for a fragment.

For example: a simply rule for Trypsin (it is more complicated) is it cuts after every K and R but not if that is followed by a P.

A protein could be like this

MADTRASHLTVCNLADRDSAAWQPLLRRRPTGVVYSAWFLVGTPKRPTTKAANR

So for each protein you have to find the fragments and then also combinations for “missed cleavage sites”

While I “could” define lookup tables the problem is that a user can specify his own replacement rules, so then I need a general way to translate his replacement rules into lookup tables.

J_Andrew_Lipscomb · October 13, 2014, 10:06pm

Hmm… maybe it makes sense to use RegEx as the way to do that. Your rule would correspond to a search pattern of “([KR])([^P])” and a replacement pattern of “$1_$2” as an example.

Markus_Winter · October 14, 2014, 7:59am

Yes, I thought about that too. Problem again is that the end user is competent with computers but not a programmer in any sense of the word. Might still go for it though, or at least have it as an option.

Will_Shank · October 14, 2014, 9:46am

Oops. I made a mistake in my test and now it looks like Memoryblock is only about 3.1x faster than split/join. Processing the same 1000 amino sequence having 100 fragments 86000 times gave these results (in seconds)

total id cleaves collect subsequences String Array: 68.767485 sum1: 2.316124 sum2: 65.501802 Memoryblock: 22.387569 sum1: 1.034649 sum2: 21.295609

Identifying cleavage points takes minimal time but of course it’s not applying real rules. I found a list of rules here http://web.expasy.org/peptide_cutter/peptidecutter_enzymes.html#exceptions which I believe can be implemented by reading 8 bytes at a time from the sequence then masking and comparing/lookup for a match. The “not amino” rules complicate this but I don’t imagine it’d be much slower, or slower than split/replace.

Reading subsequences from the memoryblock takes the most time and this consists of 2 parts: summing the fragment lengths then read with Memoryblock.StringValue. The length summing I coded is naive, repeating for each peptide, and it’s about 20% of the subsequence collecting time…

sum lengths: 4.981102    read string: 20.825534

These times are exaggerated because of extra timing code, the length summing can be improved but I don’t see how Memoryblock.StringValue can be.

Since each protein is peptidized independently Karen’s idea of helper apps so you can parallel process 8 proteins at a time may offer the biggest speed up.

Also, while I’ve tried to replicate your scenerio my times are way faster than yours, even for split/join (1m vs 2h). These times are just of the algorithm. I mean the resulting peptide strings are passed to a method but that method is empty so that it doesn’t impact the measure. Creating instances, shoving in values, appending to an array… take significant time, 5 times what just creating the peptide data is. I think your real bottleneck might be instantiating many things.

To optimize you really need to pin down where the time is coming from. You said it takes over 2 hours in a real world example with each taking 0.1 seconds. How does that 0.1 second break down? Add timing points around certain parts and triangulate in on what’s actually taking the most time, and by how much. This can be tricky though. What I said above about creating instances adding time, appending to an array increased the cleave identifying part from 1 second to 7 even though it has nothing to do with the peptide array. There can be knock on effects like that that are difficult to predict and you wouldn’t notice unless measured.

Another thing is to test and quantify the possibilities. RegEx may not sound faster but it might be, it might be way faster. If what you’re after is time and RegEx is giving the best times then you can work on how to translate user settings to a RegEx query. Also test the possibility that you even need this data, what Norman suggested. All you really need are the start and length values of each peptide, the data can be retrieved from that and the original sequence when needed.

Here’s pseudo-code of how my test was structured with timing points. Uncommenting the storePep code greatly increases time.

[code]Sub Action()

dim seq As Memoryblock = getSequenceMem

sumCleaveTime = 0
sumCollectTime = 0

t1 = Microseconds
for i = 0 to 86000
testMem(seq)
next
t2 = Microseconds

printTimes

End Sub

Private Sub testMem(seq As MemoryBlock)

t1 = Microseconds
//===================identify cleaves 5%
dim fragIdx() As integer //start byte of fragment
dim fragLen() As integer //byte length of fragment
scanSeqAndBuildFragArrays

t2 = Microseconds
//========================== collect peptides 95%
'redim pepList(-1)
for each cleave size
for each fragment start
calcLengthSum //20% of 95%
storePep( seq.StringValue(fragIdx(i), lengthSum) ) //80% of 95%
next
next

t3 = Microseconds
sumCleaveTime = sumCleaveTime + (t2-t1)
sumCollectTime = sumCollectTime + (t3-t2)
End Sub

Property pepList() As Peptide

Private Sub storePep(s As String)
'dim c As new Peptide
'c.data = s
'c.prop1 = 5
'c.prop2 = 7.2
'pepList.Append c
End Sub[/code]