Fastest way to split & concatenate strings?

Markus_Winter · October 10, 2014, 2:01pm

Hi,

I have a part of my app where long strings are split (using RepalceAllB) according to user set rules (e.g. after each R)

Then the strings are concatenated (again according to user set parameters) and added to a list for further processing, e.g.

ABCDEFG -> A B C D E F G -> A B C D E F G AB BC CD DE EF FG ABC BCD CDE DEF EFG ABCD BCDE CDEF DEFG

Each string is (on average) processed in 0.1 sec which sounds fine until you try a real world example and you need to process 80,000 strings. Then it takes over 2 hours.

Not good.

What would be the fastest way to do something like this?

TiA

Markus

P.S. tips involving MBS etc are fine

KarenA · October 10, 2014, 2:13pm

Assuming ASCII or UTF8:

DigestionArray() = ProteinString.SplitB(AminoAcidLetter)
Create the desired Array of Peptides then
ProteinSequenceString = Join(PeptideArray,"")

I don’t think you can get faster than that with pure Xojo code.

BTW If you want MBS solutions you really should mention platform as much of MBS is platform specific.

Markus_Winter · October 10, 2014, 2:51pm

Thanks Karen. I need it for MacOS X and Windows.

Going from seq = frag1 + frag2 + frag3

to using a frag() and doing a joint speeds it up slightly:

This is with 486 “protein” strings:
Uniprot_MYCGE_20120626
ImportFASTAfile: 0,003 Seconds
Process_FASTAfile: 1,058 Seconds
Process_ProteinList: 19,872 Seconds <- using seq = frag1 + frag2
Update_OccurencesInDigest_in_PeptideTable: 0,726 Seconds
Update_ProteinProperties_in_ProteinTable: 0,240 Seconds
Deal with Error or commit: 0,000 Seconds
DisplayNewDatabaseRecords: 2,844 Seconds
DisplayTableStructure: 0,002 Seconds
LoadAndDisplaySettings: 0,151 Seconds

Uniprot_MYCGE_20120626
ImportFASTAfile: 0,003 Seconds
Process_FASTAfile: 0,685 Seconds
Process_ProteinList: 17,967 Seconds <- using join
Update_OccurencesInDigest_in_PeptideTable: 0,698 Seconds
Update_ProteinProperties_in_ProteinTable: 0,236 Seconds
Deal with Error or commit: 0,000 Seconds
DisplayNewDatabaseRecords: 2,844 Seconds
DisplayTableStructure: 0,002 Seconds
LoadAndDisplaySettings: 0,152 Seconds

KarenA · October 10, 2014, 3:01pm

[quote=134618:@Markus Winter]Process_ProteinList: 17,967 Seconds <- using join
[/quote]

How is your RAM usage… Is paging a bottle neck?

If you are CPU bound have you thought about using helper apps to do that processing so you can use multiple cores simultaneously?

Karen

Markus_Winter · October 10, 2014, 3:13pm

Not yet as when I run it I make sure I have about 4 GB of RAM free.

It works with my Mycoplasma proteomes however when going to the human proteome 32bit is more of a problem as RAM usage is then going up to 3 GB (using an in-memory database for speed)

Yes. Looking into Thomas Tempelman’s MP template for that, but want to optimise the code before I do. Looking into the MBS goodies at the moment

Antonio_Rinaldi · October 10, 2014, 3:32pm

Your doing something wrong
join is really much more faster than simple concatenation

Markus_Winter · October 10, 2014, 3:37pm

[quote=134629:@Antonio Rinaldi]Your doing something wrong
join is really much more faster than simple concatenation[/quote]

Well, I’m always open for suggestions but going from this

[code] // if a seq end in “R” or “K” then it is replaced with “R_”, then split, so the last array entry would be empty
// seq = “MLK_DEHTR_WASR_QLCK_ILTVADESSAKPDK_ASSR_”

fragments = split( seq, “" ) // return array
if fragments( fragments.Ubound ) = "” then fragments.Remove( fragments.Ubound )

// so now we have all the fragments like in a complete digest but nicely lined up one after the other in the fragments array

Prot.NumberOfPeptidesInCompleteDigest = fragments.Ubound + 1 // number of fragments for complete digest

// now turn the fragments into peptides with 0, 1, or 2 missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

// ******************************************************************
// to implement: this general formula:
'// Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( NumberOfCleavageSites, MaxNumberOfMissedCleavages )
’
'Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( fragments.Ubound, DigestSettings.MaxNumberMissedCleavages )

if DigestSettings.MaxNumberMissedCleavages > 0 then // so create peptides with one missed cleavage

for fragmentNumber as integer = 0 to fragments.Ubound - 1   // make peptides for ONE missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 1, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest = 2* ( fragments.Ubound + 1 ) - 1  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 1 then // so create peptides with two missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 2   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 2, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  3 * ( fragments.Ubound + 1 ) - 3  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 2 then // so create peptides with three missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 3   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 3, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  4 * ( fragments.Ubound + 1 ) - 6  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 3 then // so create peptides with four missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 4   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3) + fragments( fragmentNumber + 4)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 4, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  5 * ( fragments.Ubound + 1 ) - 10  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 4 then // 5 is the max so create peptides with five missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 5   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3) + fragments( fragmentNumber + 4) + fragments( fragmentNumber + 5)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 5, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  6 * ( fragments.Ubound + 1 ) - 15  // number of fragments for digest

end if[/code]

to this

[code] // if a seq end in “R” or “K” then it is replaced with “R_”, then split, so the last array entry would be empty
// seq = “MLK_DEHTR_WASR_QLCK_ILTVADESSAKPDK_ASSR_”

fragments = split( seq, “" ) // return array
if fragments( fragments.Ubound ) = "” then fragments.Remove( fragments.Ubound )

// so now we have all the fragments like in a complete digest but nicely lined up one after the other in the fragments array

Prot.NumberOfPeptidesInCompleteDigest = fragments.Ubound + 1 // number of fragments for complete digest

// now turn the fragments into peptides with 0, 1, or 2 missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )   

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

// ******************************************************************
// to implement: this general formula:
'// Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( NumberOfCleavageSites, MaxNumberOfMissedCleavages )
’
'Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( fragments.Ubound, DigestSettings.MaxNumberMissedCleavages )

if DigestSettings.MaxNumberMissedCleavages > 0 then // so create peptides with one missed cleavage

for fragmentNumber as integer = 0 to fragments.Ubound - 1   // make peptides for ONE missed cleavage
  
  fragArray.Append fragments( fragmentNumber ) 
  fragArray.Append fragments( fragmentNumber  + 1)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 1, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest = 2* ( fragments.Ubound + 1 ) - 1  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 1 then // so create peptides with two missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 2   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 2, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  3 * ( fragments.Ubound + 1 ) - 3  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 2 then // so create peptides with three missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 3   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 3, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  4 * ( fragments.Ubound + 1 ) - 6  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 3 then // so create peptides with four missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 4   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  fragArray.Append fragments( fragmentNumber  + 4)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 4, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  5 * ( fragments.Ubound + 1 ) - 10  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 4 then // 5 is the max so create peptides with five missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 5   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  fragArray.Append fragments( fragmentNumber  + 4)
  fragArray.Append fragments( fragmentNumber  + 5)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 5, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  6 * ( fragments.Ubound + 1 ) - 15  // number of fragments for digest

end if[/code]

doesn’t make much difference.

Btw I found that string concatenation in Xojo is about twice as fast as it was in REAL.studio.

Antonio_Rinaldi · October 10, 2014, 3:47pm

Have you tried to profile your code?
how long it takes every CreatePeptidesFromFragmentsAndAddToPeptideList call?

Harrie_Westphal · October 10, 2014, 3:50pm

Markus, have you ever thought of eliminating the huge number of fragments.Ubound occurrences in that code by assigning that value to a property like numberOfFragments. Each of those references requires the invocation of a function and I am assuming that it is being called an awful lot of times. I know that you rebuild the fragments array in a couple of places in this code; but, if it were me, I would follow each of those with numberOfFramgments - fragments.Ubound and then use the numberOfFragments property, eliminating a huge number of function invocations.

Harrie_Westphal · October 10, 2014, 3:51pm

That was weird…that it double posted that reply.

Ulrich_Bogun · October 10, 2014, 4:15pm

Extending Harries proposal, I guess it could help too if you would nest your checks for DigestSettings.MaxNumberMissedCleavages. If the result isnt > 0, theres no need to check if its >1, >2 and so on or is that a dynamic variable that is influenced by what you do in the if end brackets?

Kem_Tekinay · October 10, 2014, 8:50pm

Markus, I didn’t take time to see how I would redo it, but I’d expect this code to roughly the same speed or even slower than your original. The speed benefit of Join comes into play when you can use it to avoid a large number of concatenations, but you are using it to make small concatenations that also have to manage the array size.

Can you work strictly in arrays? In other words, refactor your methods to take and return arrays instead of strings?

Jeff_Tullin · October 11, 2014, 6:53am

I haven’t spent much time reading the whole code, but it looks like you do the split so you have an array of fragments.
Then you loop through the array once
Then you loop again
Then you loop again
and so on.

I may have missed something, but while the conversation has focussed on the method of concatenation, Im wondering if at least part of the problem is the number of times you traverse the array of fragments

Why not loop once only, and assemble the two, three, four and five fragment sequences inside one loop, where the first fragment is ‘the current one’

(I’d pad the end with 5 dummies so that it avoids checking if you are near the end in every loop)

eg in a sequence of A B C D E F G X X X X X

You loop through from a to G like this:

[code]
for x = 1 to NumberOfFragments
twos= frag(x) + frag(x+1)
threes = twos + frag(x+2)
fours = threes + frag(x+3)

twosarray.append twos
threesarray.append threes
foursarray.append fours

next[/code]
And inside the loop, the first pass gets

AB
ABC
ABCD

The second pass gets
BC
BCD
BCDE

and so on.

You can use join at the end to glue them all together…

Antonio_Rinaldi · October 11, 2014, 8:23am

Without knowing anything about what is executed in CreatePeptidesFromFragmentsAndAddToPeptideList
I did a comparison commenting the CreatePeptidesFromFragmentsAndAddToPeptideList call, and with only minor changes just to make your code testable
The result is that the array version is nearly three times faster than the concatenation version.
Given that you’ve found an improvement of about 10%, it is obvious that you are optimizing the wrong function.

It is CreatePeptidesFromFragmentsAndAddToPeptideList that slows down the execution.

Eric_Gibbon · October 11, 2014, 11:17am

Another little performance tip is to store the maximum value for a loop in a local variable. This helps as the maximum number is calculated only once and does not have to be recalculated each time through the loop. e.g.

Dim n as Integer = fragments.Ubound

for fragmentNumber as integer = 0 to n // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

Kem_Tekinay · October 11, 2014, 2:15pm

Another idea: Instead of arrays, use MemoryBlocks and peek/poke directly into them. I don’t know if that concept will work for you but I’ve used it with great success in speeding up string manipulation.

Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).

KarenA · October 11, 2014, 2:27pm

[quote=134756:@Kem Tekinay]
Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).[/quote]

I think the files he is parsing for the data should only contain ASCII characters so are probably in ASCII encoding. as well … I have edited them in simple text editors but I have never written code to process them.

see:
http://en.wikipedia.org/wiki/FASTA_format

J_Andrew_Lipscomb · October 11, 2014, 6:48pm

I was thinking that myself. Also, MemoryBlocks really ought to have .Peek and .Poke methods for history’s sake…

Markus_Winter · October 13, 2014, 8:17am

Thanks, done.

Thanks.

Nice! Have to try this.

Thanks all. Will report back how I get on

Markus_Winter · October 13, 2014, 8:20am

[quote=134756:@Kem Tekinay]Another idea: Instead of arrays, use MemoryBlocks and peek/poke directly into them. I don’t know if that concept will work for you but I’ve used it with great success in speeding up string manipulation.

Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).[/quote]
I was thinking about memory blocks

Yes, only 20 amino acid characters, but I think encoding is UTF8 (Xojo’s internal encoding).