Help optimize this code [formerly Why is Ptr...]

Kem_Tekinay · November 22, 2018, 6:41am

I am optimizing my Bcrypt code trying to get it closer in performance to, say. PHP, and have discovered something surprising. In the main time-sucking method, if I comment out the lines that write back to a MemoryBlock via Ptr.UInt32( index ), my test code runs (and fails) in about 18 ms. Once I put those lines back, it takes about 250 ms. Mind you, the MemoryBlock is 4,096 bytes and I have to cycle through every one of them several times, but the code reads from that MemoryBlock too (several times on each pass, in fact), and reading doesn’t suffer the same performance issues as writing.

Why, and how might I speed this up?

Christian_Schmitz · November 22, 2018, 8:20am

memoryblock.StringValue() would be faster reading in a block.

And #pragma disablebackgroundtasks would avoid the yield on each next.

Kem_Tekinay · November 22, 2018, 11:47am

Reading isn’t the issue, and it’s a new framework MemoryBlock anyway.

All the relevant pragmas are in place. Without them, processing time jumps to over a second.

Christian_Schmitz · November 22, 2018, 11:58am

it may be faster to grab the ptr and use UInt32 there.

Christian_Schmitz · November 22, 2018, 12:16pm

Looks like it’s doing

NilObjectCheck
Calls Xojo.Core.MemoryBlock.UInt32 setter
this Calls MemoryBlock_Int32Value in framework
which does OutOfBounds checking
sets value

That is a lot of overhead, if you can make sure the bounds are valid and ptr is not nil, using ptr should be really faster.

Christian_Schmitz · November 22, 2018, 12:16pm

With Ptr you get:

0xee4c0 <+305>: movl   %eax, (%ecx)
0xee4c2 <+307>: movl   $0x567, %ecx              ; imm = 0x567 
0xee4c7 <+312>: movl   -0x40(%ebp), %edx

for

p.UInt32(&h124) = &h567

which is just three instructions!

Kem_Tekinay · November 22, 2018, 12:20pm

But I am using a ptr, which was the point of the original question.

Someone mentioned that perhaps commenting those lines is skewing my results if the compiler realizes the calculations aren’t needed and eliminates the loop entirely. I’ll test that in a bit.

Christian_Schmitz · November 22, 2018, 12:23pm

Can you post test code?

Kem_Tekinay · November 22, 2018, 12:28pm

Yes, I’ll do that once I’m back at my machine.

Mike_D · November 22, 2018, 3:03pm

Beware that #pragma BackgroundTasks OFF is broken: <https://xojo.com/issue/53396> - this confused me when doing some performance testing. #Pragma BackgroundTasks False works.

Sascha_S · November 22, 2018, 3:45pm

It’s not broken. It’s simply not supported, but will be maybe implemented one day.

PS: I’d prefer to remove the support for ON/OFF with BreakOnExceptions.

Kem_Tekinay · November 22, 2018, 5:09pm

Well, this is embarrassing. It turns out that @Jeremy Cowgar was right, the compiler optimized away my loop when I commented out the part where the values are written back to the MemoryBlock, so it’s not that the writes are particularly slow at all!

This is the function if anyone wants to make suggestions. Note that this has been inlined intentionally to avoid the overhead of function calls.

Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
  #if not DebugBuild
    #pragma BackgroundTasks False
    #pragma BoundsChecking False
    #pragma NilObjectChecking False
    #pragma StackOverflowChecking False
  #endif
  
  const kLastIndex as integer = BLF_N + 1 // = 17
  const kShift3 as UInt32 = 256 ^ 3
  const kShift2 as UInt32 = 256 ^ 2
  const kShift1 as UInt32 = 256 ^ 1
  
  const kMask0 as UInt32 = &hFF000000
  const kMask1 as UInt32 = &h00FF0000
  const kMask2 as UInt32 = &h0000FF00
  const kMask3 as UInt32 = &h000000FF
  
  dim myPPtr as ptr = PPtr
  dim mySPtr as ptr = SPtr
  
  //
  // Create the stream keys
  //
  for keyIndex as integer = 0 to keys.Ubound
    dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
    dim keySize as integer = key.Size
    
    if keySize = 0 then
      RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
    end if
    WasKeySet = true
    
    dim streamKey as Xojo.Core.MutableMemoryBlock
    dim streamKeySize as integer = keySize
    if ( keySize mod 4 ) = 0 then
      streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
      streamKey.Left( keySize ) = key
    else
      streamKeySize = streamKeySize * 4
      
      streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
      
      streamKey.Left( keySize ) = key
      streamKey.Mid( keySize, keySize ) = key
      streamKey.Mid( keySize + keySize, keySize ) = key
      streamKey.Right( keySize ) = key
    end if
    
    if IsLittleEndian then
      //
      // Swap the bytes
      //
      dim streamKeyPtr as ptr = streamKey.Data
      dim swapIndex as integer
      while swapIndex < streamKeySize
        dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
        streamKeyPtr.UInt32( swapIndex ) = _
        ( temp \\ kShift3 ) or _
        ( ( temp and kMask1 ) \\ kShift1 ) or _
        ( ( temp and kMask2 ) * kShift1 ) or _
        ( temp * kShift3 )
        
        swapIndex = swapIndex + 4
      wend
    end if
    
    keys( keyIndex ) = streamKey
  next
  
  for rep as integer = 1 to repetitions
    
    for keyIndex as integer = 0 to keys.Ubound
      
      dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
      dim keyPtr as ptr = key.Data
      dim keySize as integer = key.Size
      
      dim j as integer
      dim i, k as integer, arrIndex as integer
      dim temp as UInt32
      dim d0, d1 as UInt32
      
      for i = 0 to kLastIndex
        'temp = Stream2Word( key, j, streamBuffer, streamBufferPtr )
        
        if j = keySize then
          j = 0
        end if
        
        temp = keyPtr.UInt32( j )
        j = j + 4
        
        
        arrIndex = i * 4
        myPPtr.UInt32( arrIndex ) = myPPtr.UInt32( arrIndex ) xor temp
      next i
      
      dim a, b, c, d as integer // Used as indexes
      dim inner as integer
      dim xl as UInt32 
      dim xr as UInt32 
      dim j1 as UInt32
      
      j = 0
      arrIndex = 0
      for i = 0 to kLastIndex step 2
        'self.Encipher( d0, d1 )
        
        xl = d0
        xr = d1
        
        xl = xl xor myPPtr.UInt32( 0 )
        
        for inner = 1 to 16 step 2
          j1 = xl
          
          a = ( j1 \\ kShift3 )
          b = ( j1 \\ kShift2 ) and kMask3
          c = ( j1 \\ kShift1 ) and kMask3
          d = j1 and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
          
          j1 = xr
          
          a = ( j1 \\ kShift3 ) 
          b = ( j1 \\ kShift2 ) and kMask3
          c = ( j1 \\ kShift1 ) and kMask3
          d = j1 and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
        next inner
        
        xr = xr xor myPPtr.UInt32( 17 * 4 )
        
        d0 = xr
        d1 = xl
        
        myPPtr.UInt32( arrIndex ) = d0
        arrIndex = arrIndex + 4
        myPPtr.UInt32( arrIndex ) = d1
        arrIndex = arrIndex + 4
      next i
      
      dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
      arrIndex = 0
      for i = 0 to 3
        for k = 0 to 255 step 2
          'self.Encipher( d0, d1 )
          
          xl = d0
          xr = d1
          
          xl = xl xor firstPPtr
          
          for inner = 1 to 16 step 2
            j1 = xl
            
            a = ( j1 \\ kShift3 )
            b = ( j1 \\ kShift2 ) and kMask3
            c = ( j1 \\ kShift1 ) and kMask3
            d = j1 and kMask3
            
            j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
            xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
            + mySPtr.UInt32( ( 768 + d ) * 4 )
            
            xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
            
            j1 = xr
            
            a = ( j1 \\ kShift3 ) 
            b = ( j1 \\ kShift2 ) and kMask3
            c = ( j1 \\ kShift1 ) and kMask3
            d = j1 and kMask3
            
            j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
            xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
            + mySPtr.UInt32( ( 768 + d ) * 4 )
            
            xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
          next inner
          
          xr = xr xor myPPtr.UInt32( 17 * 4 )
          
          d0 = xr
          d1 = xl
          
          
          mySPtr.UInt32( arrIndex ) = d0
          arrIndex = arrIndex + 4
          mySPtr.UInt32( arrIndex ) = d1
          arrIndex = arrIndex + 4
        next k
        
      next i
    next keyIndex
    
  next rep
End Sub

Kem_Tekinay · November 22, 2018, 5:17pm

With the same settings and aggressive compile, my Bcrypt code takes about 250 ms, whereas Bcrypt through PHP takes about 210 ms. I’m trying to find that other 40 ms.

Kem_Tekinay · November 22, 2018, 5:32pm

If you want to see this in context, the code is in this branch:

https://github.com/ktekinay/M_Crypto/tree/feature/Bcrypt_optimizations

Greg_O_Lone · November 25, 2018, 12:36pm

Just curious about your php config…

Do you use Zend? Does PHP stay in memory all the time? When you checked the speed, had php cached or precompiled the script? Is PHP set up to use multiple cores?

What Im getting at is that youre probably not comparing apples to apples here.

Kem_Tekinay · November 25, 2018, 12:44pm

I’m running it through the command line with a simple php -r '<script>', but you bring up an interesting point about caching issues as I’m just looping over the same password 20 times and taking the average. I’ll think about how to update that so it will have to calculate each time.

Kem_Tekinay · November 25, 2018, 1:33pm

Welp, providing an array of 20 passwords certainly made PHP take a bit longer, but it also made my class take longer too, so …

Kem_Tekinay · November 25, 2018, 2:35pm

The method as it stands today:

Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
  #if not DebugBuild
    #pragma BackgroundTasks False
    #pragma BoundsChecking False
    #pragma NilObjectChecking False
    #pragma StackOverflowChecking False
  #endif
  
  const kLastIndex as integer = BLF_N + 1
  
  const kShift3 as UInt32 = 256 ^ 3
  const kShift2 as UInt32 = 256 ^ 2
  const kShift1 as UInt32 = 256 ^ 1
  
  const kMask0 as UInt32 = &hFF000000
  const kMask1 as UInt32 = &h00FF0000
  const kMask2 as UInt32 = &h0000FF00
  const kMask3 as UInt32 = &h000000FF
  
  const kSLastByte as integer = 4095
  
  dim myPPtr as ptr = PPtr
  dim mySPtr as ptr = SPtr
  dim isLittleEndian as boolean = self.IsLittleEndian
  
  //
  // Create the stream keys
  //
  const kStreamWordSize as integer = 8
  
  #if kDebug then
    dim startMs, elapsedMs as Double
    dim logPrefix as string = CurrentMethodName + ": "
  #endif
  
  #if kDebug then
    startMs = Microseconds
  #endif
  for keyIndex as integer = 0 to keys.Ubound
    dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
    dim keySize as integer = key.Size
    
    if keySize = 0 then
      RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
    end if
    WasKeySet = true
    
    dim streamKey as Xojo.Core.MutableMemoryBlock
    dim streamKeySize as integer = keySize
    if ( keySize mod kStreamWordSize ) = 0 then
      streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
      streamKey.Left( keySize ) = key
    else
      streamKeySize = streamKeySize * kStreamWordSize
      
      streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
      dim lastByte as integer = streamKeySize - 1
      for thisByte as integer = 0 to lastByte step keySize
        streamKey.Mid( thisByte, keySize ) = key
      next
    end if
    
    if isLittleEndian then
      //
      // Swap the bytes
      //
      dim streamKeyPtr as ptr = streamKey.Data
      dim swapIndex as integer
      while swapIndex < streamKeySize
        dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
        streamKeyPtr.UInt32( swapIndex ) = _
        ( temp \\ kShift3 ) or _
        ( ( temp and kMask1 ) \\ kShift1 ) or _
        ( ( temp and kMask2 ) * kShift1 ) or _
        ( temp * kShift3 )
        
        swapIndex = swapIndex + 4
      wend
    end if
    
    keys( keyIndex ) = streamKey
  next
  
  #if kDebug then
    elapsedMs = Microseconds - startMs
    System.DebugLog logPrefix + "Stream keys took " + format( elapsedMs, "#,0.0##" ) + " µs"
  #endif
  
  for rep as integer = 1 to repetitions
    
    for keyIndex as integer = 0 to keys.Ubound
      
      dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
      dim keyPtr as ptr = key.Data
      dim keySize as integer = key.Size
      
      dim j as integer
      dim i as integer, arrIndex as integer
      dim d0, d1 as UInt32
      
      arrIndex = 0
      dim barrier as integer = keySize - kStreamWordSize
      for i = 0 to kLastIndex step 2 // Two words at a time
        'temp = Stream2Word( key, j, streamBuffer, streamBufferPtr )
        
        myPPtr.UInt64( arrIndex ) = myPPtr.UInt64( arrIndex ) xor keyPtr.UInt64( j )
        if j = barrier then
          j = 0
        else
          j = j + kStreamWordSize
        end if
        
        arrIndex = arrIndex + kStreamWordSize
      next i
      
      dim a, b, c, d as integer // Used as indexes
      dim inner as integer
      dim xl as UInt32 
      dim xr as UInt32 
      dim j1 as UInt32
      dim pptrEncoderIndex as integer
      
      j = 0
      arrIndex = 0
      for i = 0 to kLastIndex step 2
        'self.Encipher( d0, d1 )
        
        xl = d0 xor myPPtr.UInt32( 0 )
        xr = d1
        
        pptrEncoderIndex = 4
        for inner = 1 to 16 step 2
          a = ( xl \\ kShift3 )
          b = ( xl \\ kShift2 ) and kMask3
          c = ( xl \\ kShift1 ) and kMask3
          d = xl and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          pptrEncoderIndex = pptrEncoderIndex + 4
          
          a = ( xr \\ kShift3 ) 
          b = ( xr \\ kShift2 ) and kMask3
          c = ( xr \\ kShift1 ) and kMask3
          d = xr and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          pptrEncoderIndex = pptrEncoderIndex + 4
        next inner
        
        xr = xr xor myPPtr.UInt32( pptrEncoderIndex )
        
        d0 = xr
        d1 = xl
        
        
        myPPtr.UInt32( arrIndex ) = d0
        arrIndex = arrIndex + 4
        myPPtr.UInt32( arrIndex ) = d1
        arrIndex = arrIndex + 4
      next i
      
      dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
      for arrIndex = 0 to kSLastByte step 8
        'self.Encipher( d0, d1 )
        
        xl = d0 xor firstPPtr
        xr = d1
        
        pptrEncoderIndex = 4
        for inner = 1 to 16 step 2
          a = ( xl \\ kShift3 )
          b = ( xl \\ kShift2 ) and kMask3
          c = ( xl \\ kShift1 ) and kMask3
          d = xl and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          pptrEncoderIndex = pptrEncoderIndex + 4
          
          a = ( xr \\ kShift3 ) 
          b = ( xr \\ kShift2 ) and kMask3
          c = ( xr \\ kShift1 ) and kMask3
          d = xr and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          pptrEncoderIndex = pptrEncoderIndex + 4
        next inner
        
        xr = xr xor myPPtr.UInt32( pptrEncoderIndex )
        
        d0 = xr
        d1 = xl
        
        
        mySPtr.UInt32( arrIndex ) = d0
        mySPtr.UInt32( arrIndex + 4 ) = d1
      next arrIndex
      
    next keyIndex
  next rep
End Sub

Kem_Tekinay · November 25, 2018, 10:17pm

My next thought is to convert everything to use UInt64 instead of UInt32. That would make some difference, no?

KarenA · November 25, 2018, 10:57pm

Maybe… only testing would tell for sure.

But even if there is speed increase on a 64 bit OS, is there a chance the code will need to run on a 32 bit OS? (maybe on Windows?) If so, do you care about that?

karen