Fastest way to split & concatenate strings?

Hi,

I have a part of my app where long strings are split (using RepalceAllB) according to user set rules (e.g. after each R)

Then the strings are concatenated (again according to user set parameters) and added to a list for further processing, e.g.

ABCDEFG -> A B C D E F G -> A B C D E F G AB BC CD DE EF FG ABC BCD CDE DEF EFG ABCD BCDE CDEF DEFG

Each string is (on average) processed in 0.1 sec which sounds fine until you try a real world example and you need to process 80,000 strings. Then it takes over 2 hours.

Not good.

What would be the fastest way to do something like this?

TiA

Markus

P.S. tips involving MBS etc are fine :wink:

Assuming ASCII or UTF8:

  1. DigestionArray() = ProteinString.SplitB(AminoAcidLetter)

  2. Create the desired Array of Peptides then
    ProteinSequenceString = Join(PeptideArray,"")

I don’t think you can get faster than that with pure Xojo code.

BTW If you want MBS solutions you really should mention platform as much of MBS is platform specific.

Thanks Karen. I need it for MacOS X and Windows.

Going from seq = frag1 + frag2 + frag3

to using a frag() and doing a joint speeds it up slightly:

This is with 486 “protein” strings:
Uniprot_MYCGE_20120626
ImportFASTAfile: 0,003 Seconds
Process_FASTAfile: 1,058 Seconds
Process_ProteinList: 19,872 Seconds <- using seq = frag1 + frag2
Update_OccurencesInDigest_in_PeptideTable: 0,726 Seconds
Update_ProteinProperties_in_ProteinTable: 0,240 Seconds
Deal with Error or commit: 0,000 Seconds
DisplayNewDatabaseRecords: 2,844 Seconds
DisplayTableStructure: 0,002 Seconds
LoadAndDisplaySettings: 0,151 Seconds

Uniprot_MYCGE_20120626
ImportFASTAfile: 0,003 Seconds
Process_FASTAfile: 0,685 Seconds
Process_ProteinList: 17,967 Seconds <- using join
Update_OccurencesInDigest_in_PeptideTable: 0,698 Seconds
Update_ProteinProperties_in_ProteinTable: 0,236 Seconds
Deal with Error or commit: 0,000 Seconds
DisplayNewDatabaseRecords: 2,844 Seconds
DisplayTableStructure: 0,002 Seconds
LoadAndDisplaySettings: 0,152 Seconds

[quote=134618:@Markus Winter]Process_ProteinList: 17,967 Seconds <- using join
[/quote]

How is your RAM usage… Is paging a bottle neck?

If you are CPU bound have you thought about using helper apps to do that processing so you can use multiple cores simultaneously?

  • Karen

Not yet as when I run it I make sure I have about 4 GB of RAM free.

It works with my Mycoplasma proteomes however when going to the human proteome 32bit is more of a problem as RAM usage is then going up to 3 GB (using an in-memory database for speed) …

Yes. Looking into Thomas Tempelman’s MP template for that, but want to optimise the code before I do. Looking into the MBS goodies at the moment …

Your doing something wrong
join is really much more faster than simple concatenation

[quote=134629:@Antonio Rinaldi]Your doing something wrong
join is really much more faster than simple concatenation[/quote]
:slight_smile:

Well, I’m always open for suggestions but going from this

[code] // if a seq end in “R” or “K” then it is replaced with “R_”, then split, so the last array entry would be empty
// seq = “MLK_DEHTR_WASR_QLCK_ILTVADESSAKPDK_ASSR_”

fragments = split( seq, “" ) // return array
if fragments( fragments.Ubound ) = "
” then fragments.Remove( fragments.Ubound )

// so now we have all the fragments like in a complete digest but nicely lined up one after the other in the fragments array

Prot.NumberOfPeptidesInCompleteDigest = fragments.Ubound + 1 // number of fragments for complete digest

// now turn the fragments into peptides with 0, 1, or 2 missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

next

Prot.NumberOfPeptidesInDigest = fragments.Ubound + 1 // number of fragments for digest

// ******************************************************************
// to implement: this general formula:
'// Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( NumberOfCleavageSites, MaxNumberOfMissedCleavages )

'Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( fragments.Ubound, DigestSettings.MaxNumberMissedCleavages )

if DigestSettings.MaxNumberMissedCleavages > 0 then // so create peptides with one missed cleavage

for fragmentNumber as integer = 0 to fragments.Ubound - 1   // make peptides for ONE missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 1, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest = 2* ( fragments.Ubound + 1 ) - 1  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 1 then // so create peptides with two missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 2   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 2, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  3 * ( fragments.Ubound + 1 ) - 3  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 2 then // so create peptides with three missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 3   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 3, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  4 * ( fragments.Ubound + 1 ) - 6  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 3 then // so create peptides with four missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 4   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3) + fragments( fragmentNumber + 4)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 4, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  5 * ( fragments.Ubound + 1 ) - 10  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 4 then // 5 is the max so create peptides with five missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 5   // make peptides for TWO missed cleavage
  
  seq = fragments( fragmentNumber ) + fragments( fragmentNumber + 1)  + fragments( fragmentNumber + 2) + fragments( fragmentNumber + 3) + fragments( fragmentNumber + 4) + fragments( fragmentNumber + 5)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 5, fragmentNumber, Prot, fragments )
  
next

Prot.NumberOfPeptidesInDigest =  6 * ( fragments.Ubound + 1 ) - 15  // number of fragments for digest

end if[/code]

to this

[code] // if a seq end in “R” or “K” then it is replaced with “R_”, then split, so the last array entry would be empty
// seq = “MLK_DEHTR_WASR_QLCK_ILTVADESSAKPDK_ASSR_”

fragments = split( seq, “" ) // return array
if fragments( fragments.Ubound ) = "
” then fragments.Remove( fragments.Ubound )

// so now we have all the fragments like in a complete digest but nicely lined up one after the other in the fragments array

Prot.NumberOfPeptidesInCompleteDigest = fragments.Ubound + 1 // number of fragments for complete digest

// now turn the fragments into peptides with 0, 1, or 2 missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )   

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

next

Prot.NumberOfPeptidesInDigest = fragments.Ubound + 1 // number of fragments for digest

// ******************************************************************
// to implement: this general formula:
'// Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( NumberOfCleavageSites, MaxNumberOfMissedCleavages )

'Prot.NumberOfPeptidesInDigest = CalculateTotalNumberOfFragments( fragments.Ubound, DigestSettings.MaxNumberMissedCleavages )

if DigestSettings.MaxNumberMissedCleavages > 0 then // so create peptides with one missed cleavage

for fragmentNumber as integer = 0 to fragments.Ubound - 1   // make peptides for ONE missed cleavage
  
  fragArray.Append fragments( fragmentNumber ) 
  fragArray.Append fragments( fragmentNumber  + 1)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 1, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest = 2* ( fragments.Ubound + 1 ) - 1  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 1 then // so create peptides with two missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 2   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 2, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  3 * ( fragments.Ubound + 1 ) - 3  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 2 then // so create peptides with three missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 3   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 3, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  4 * ( fragments.Ubound + 1 ) - 6  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 3 then // so create peptides with four missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 4   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  fragArray.Append fragments( fragmentNumber  + 4)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 4, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  5 * ( fragments.Ubound + 1 ) - 10  // number of fragments for digest

end if

if DigestSettings.MaxNumberMissedCleavages > 4 then // 5 is the max so create peptides with five missed cleavages

for fragmentNumber as integer = 0 to fragments.Ubound - 5   // make peptides for TWO missed cleavage
  
  fragArray.Append fragments( fragmentNumber )
  fragArray.Append fragments( fragmentNumber  + 1)
  fragArray.Append fragments( fragmentNumber  + 2)
  fragArray.Append fragments( fragmentNumber  + 3)
  fragArray.Append fragments( fragmentNumber  + 4)
  fragArray.Append fragments( fragmentNumber  + 5)
  
  seq = join( fragArray, "" )
  
  ReDim fragArray(-1)
  
  CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 5, fragmentNumber, Prot, fragments ) 
  
next

Prot.NumberOfPeptidesInDigest =  6 * ( fragments.Ubound + 1 ) - 15  // number of fragments for digest

end if[/code]

doesn’t make much difference.

Btw I found that string concatenation in Xojo is about twice as fast as it was in REAL.studio.

Have you tried to profile your code?
how long it takes every CreatePeptidesFromFragmentsAndAddToPeptideList call?

Markus, have you ever thought of eliminating the huge number of fragments.Ubound occurrences in that code by assigning that value to a property like numberOfFragments. Each of those references requires the invocation of a function and I am assuming that it is being called an awful lot of times. I know that you rebuild the fragments array in a couple of places in this code; but, if it were me, I would follow each of those with numberOfFramgments - fragments.Ubound and then use the numberOfFragments property, eliminating a huge number of function invocations.

That was weird…that it double posted that reply.

Extending Harrie‘s proposal, I guess it could help too if you would nest your checks for DigestSettings.MaxNumberMissedCleavages. If the result isn‘t > 0, there‘s no need to check if it‘s >1, >2 and so on – or is that a dynamic variable that is influenced by what you do in the if … end brackets?

Markus, I didn’t take time to see how I would redo it, but I’d expect this code to roughly the same speed or even slower than your original. The speed benefit of Join comes into play when you can use it to avoid a large number of concatenations, but you are using it to make small concatenations that also have to manage the array size.

Can you work strictly in arrays? In other words, refactor your methods to take and return arrays instead of strings?

I haven’t spent much time reading the whole code, but it looks like you do the split so you have an array of fragments.
Then you loop through the array once
Then you loop again
Then you loop again
and so on.

I may have missed something, but while the conversation has focussed on the method of concatenation, Im wondering if at least part of the problem is the number of times you traverse the array of fragments

Why not loop once only, and assemble the two, three, four and five fragment sequences inside one loop, where the first fragment is ‘the current one’

(I’d pad the end with 5 dummies so that it avoids checking if you are near the end in every loop)

eg in a sequence of A B C D E F G X X X X X

You loop through from a to G like this:

[code]
for x = 1 to NumberOfFragments
twos= frag(x) + frag(x+1)
threes = twos + frag(x+2)
fours = threes + frag(x+3)

twosarray.append twos
threesarray.append threes
foursarray.append fours

next[/code]
And inside the loop, the first pass gets

AB
ABC
ABCD

The second pass gets
BC
BCD
BCDE

and so on.

You can use join at the end to glue them all together…

Without knowing anything about what is executed in CreatePeptidesFromFragmentsAndAddToPeptideList
I did a comparison commenting the CreatePeptidesFromFragmentsAndAddToPeptideList call, and with only minor changes just to make your code testable
The result is that the array version is nearly three times faster than the concatenation version.
Given that you’ve found an improvement of about 10%, it is obvious that you are optimizing the wrong function.

It is CreatePeptidesFromFragmentsAndAddToPeptideList that slows down the execution.

Another little performance tip is to store the maximum value for a loop in a local variable. This helps as the maximum number is calculated only once and does not have to be recalculated each time through the loop. e.g.

Dim n as Integer = fragments.Ubound

for fragmentNumber as integer = 0 to n // create peptides with no missed dcleavage

seq = fragments( fragmentNumber )

CreatePeptidesFromFragmentsAndAddToPeptideList( seq, 0, fragmentNumber, Prot, fragments )

next

Another idea: Instead of arrays, use MemoryBlocks and peek/poke directly into them. I don’t know if that concept will work for you but I’ve used it with great success in speeding up string manipulation.

Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).

[quote=134756:@Kem Tekinay]
Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).[/quote]

I think the files he is parsing for the data should only contain ASCII characters so are probably in ASCII encoding. as well … I have edited them in simple text editors but I have never written code to process them.

see:
http://en.wikipedia.org/wiki/FASTA_format

I was thinking that myself. Also, MemoryBlocks really ought to have .Peek and .Poke methods for history’s sake…

Thanks, done.

Thanks.

Nice! Have to try this.

Thanks all. Will report back how I get on :slight_smile:

[quote=134756:@Kem Tekinay]Another idea: Instead of arrays, use MemoryBlocks and peek/poke directly into them. I don’t know if that concept will work for you but I’ve used it with great success in speeding up string manipulation.

Naturally it’s only an option when dealing with single byte encodings (unless you want to deal with Unicode encodings yourself).[/quote]
I was thinking about memory blocks …

Yes, only 20 amino acid characters, but I think encoding is UTF8 (Xojo’s internal encoding).