Help optimize this code [formerly Why is Ptr...]

  1. 3 weeks ago

    Kem T

    Nov 21 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I am optimizing my Bcrypt code trying to get it closer in performance to, say. PHP, and have discovered something surprising. In the main time-sucking method, if I comment out the lines that write back to a MemoryBlock via Ptr.UInt32( index ), my test code runs (and fails) in about 18 ms. Once I put those lines back, it takes about 250 ms. Mind you, the MemoryBlock is 4,096 bytes and I have to cycle through every one of them several times, but the code reads from that MemoryBlock too (several times on each pass, in fact), and reading doesn't suffer the same performance issues as writing.

    Why, and how might I speed this up?

  2. Christian S

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    memoryblock.StringValue() would be faster reading in a block.

    And #pragma disablebackgroundtasks would avoid the yield on each next.

  3. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    Reading isn't the issue, and it's a new framework MemoryBlock anyway.

    All the relevant pragmas are in place. Without them, processing time jumps to over a second.

  4. Christian S

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    it may be faster to grab the ptr and use UInt32 there.

  5. Christian S

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    Looks like it's doing

    • NilObjectCheck
    • Calls Xojo.Core.MemoryBlock.UInt32 setter
    • this Calls MemoryBlock_Int32Value in framework
    • which does OutOfBounds checking
    • sets value

    That is a lot of overhead, if you can make sure the bounds are valid and ptr is not nil, using ptr should be really faster.

  6. Christian S

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    With Ptr you get:

    0xee4c0 <+305>: movl %eax, (%ecx)
    0xee4c2 <+307>: movl $0x567, %ecx ; imm = 0x567
    0xee4c7 <+312>: movl -0x40(%ebp), %edx

    for

    p.UInt32(&h124) = &h567

    which is just three instructions!

  7. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    But I am using a ptr, which was the point of the original question.

    Someone mentioned that perhaps commenting those lines is skewing my results if the compiler realizes the calculations aren't needed and eliminates the loop entirely. I'll test that in a bit.

  8. Christian S

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    Can you post test code?

  9. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    Yes, I'll do that once I'm back at my machine.

  10. Michael D

    Nov 22 Pre-Release Testers, Xojo Pro

    Beware that #pragma BackgroundTasks OFF is broken: Feedback Case #53396 - this confused me when doing some performance testing. #Pragma BackgroundTasks False works.

  11. Sascha S

    Nov 22 Pre-Release Testers, Xojo Pro Germany
    Edited 3 weeks ago

    @Michael D Beware that #pragma BackgroundTasks OFF is broken

    It's not broken. It's simply not supported, but will be maybe implemented one day. :)

    PS: I'd prefer to remove the support for ON/OFF with BreakOnExceptions. ;)

  12. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    Well, this is embarrassing. It turns out that @Jeremy C was right, the compiler optimized away my loop when I commented out the part where the values are written back to the MemoryBlock, so it's not that the writes are particularly slow at all!

    This is the function if anyone wants to make suggestions. Note that this has been inlined intentionally to avoid the overhead of function calls.

    Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
      #if not DebugBuild
        #pragma BackgroundTasks False
        #pragma BoundsChecking False
        #pragma NilObjectChecking False
        #pragma StackOverflowChecking False
      #endif
      
      const kLastIndex as integer = BLF_N + 1 // = 17
      const kShift3 as UInt32 = 256 ^ 3
      const kShift2 as UInt32 = 256 ^ 2
      const kShift1 as UInt32 = 256 ^ 1
      
      const kMask0 as UInt32 = &hFF000000
      const kMask1 as UInt32 = &h00FF0000
      const kMask2 as UInt32 = &h0000FF00
      const kMask3 as UInt32 = &h000000FF
      
      dim myPPtr as ptr = PPtr
      dim mySPtr as ptr = SPtr
      
      //
      // Create the stream keys
      //
      for keyIndex as integer = 0 to keys.Ubound
        dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
        dim keySize as integer = key.Size
        
        if keySize = 0 then
          RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
        end if
        WasKeySet = true
        
        dim streamKey as Xojo.Core.MutableMemoryBlock
        dim streamKeySize as integer = keySize
        if ( keySize mod 4 ) = 0 then
          streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
          streamKey.Left( keySize ) = key
        else
          streamKeySize = streamKeySize * 4
          
          streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
          
          streamKey.Left( keySize ) = key
          streamKey.Mid( keySize, keySize ) = key
          streamKey.Mid( keySize + keySize, keySize ) = key
          streamKey.Right( keySize ) = key
        end if
        
        if IsLittleEndian then
          //
          // Swap the bytes
          //
          dim streamKeyPtr as ptr = streamKey.Data
          dim swapIndex as integer
          while swapIndex < streamKeySize
            dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
            streamKeyPtr.UInt32( swapIndex ) = _
            ( temp \ kShift3 ) or _
            ( ( temp and kMask1 ) \ kShift1 ) or _
            ( ( temp and kMask2 ) * kShift1 ) or _
            ( temp * kShift3 )
            
            swapIndex = swapIndex + 4
          wend
        end if
        
        keys( keyIndex ) = streamKey
      next
      
      for rep as integer = 1 to repetitions
        
        for keyIndex as integer = 0 to keys.Ubound
          
          dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
          dim keyPtr as ptr = key.Data
          dim keySize as integer = key.Size
          
          dim j as integer
          dim i, k as integer, arrIndex as integer
          dim temp as UInt32
          dim d0, d1 as UInt32
          
          for i = 0 to kLastIndex
            'temp = Stream2Word( key, j, streamBuffer, streamBufferPtr )
            
            if j = keySize then
              j = 0
            end if
            
            temp = keyPtr.UInt32( j )
            j = j + 4
            
            
            arrIndex = i * 4
            myPPtr.UInt32( arrIndex ) = myPPtr.UInt32( arrIndex ) xor temp
          next i
          
          dim a, b, c, d as integer // Used as indexes
          dim inner as integer
          dim xl as UInt32 
          dim xr as UInt32 
          dim j1 as UInt32
          
          j = 0
          arrIndex = 0
          for i = 0 to kLastIndex step 2
            'self.Encipher( d0, d1 )
            
            xl = d0
            xr = d1
            
            xl = xl xor myPPtr.UInt32( 0 )
            
            for inner = 1 to 16 step 2
              j1 = xl
              
              a = ( j1 \ kShift3 )
              b = ( j1 \ kShift2 ) and kMask3
              c = ( j1 \ kShift1 ) and kMask3
              d = j1 and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
              
              j1 = xr
              
              a = ( j1 \ kShift3 ) 
              b = ( j1 \ kShift2 ) and kMask3
              c = ( j1 \ kShift1 ) and kMask3
              d = j1 and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
            next inner
            
            xr = xr xor myPPtr.UInt32( 17 * 4 )
            
            d0 = xr
            d1 = xl
            
            myPPtr.UInt32( arrIndex ) = d0
            arrIndex = arrIndex + 4
            myPPtr.UInt32( arrIndex ) = d1
            arrIndex = arrIndex + 4
          next i
          
          dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
          arrIndex = 0
          for i = 0 to 3
            for k = 0 to 255 step 2
              'self.Encipher( d0, d1 )
              
              xl = d0
              xr = d1
              
              xl = xl xor firstPPtr
              
              for inner = 1 to 16 step 2
                j1 = xl
                
                a = ( j1 \ kShift3 )
                b = ( j1 \ kShift2 ) and kMask3
                c = ( j1 \ kShift1 ) and kMask3
                d = j1 and kMask3
                
                j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
                xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
                + mySPtr.UInt32( ( 768 + d ) * 4 )
                
                xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
                
                j1 = xr
                
                a = ( j1 \ kShift3 ) 
                b = ( j1 \ kShift2 ) and kMask3
                c = ( j1 \ kShift1 ) and kMask3
                d = j1 and kMask3
                
                j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
                xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
                + mySPtr.UInt32( ( 768 + d ) * 4 )
                
                xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
              next inner
              
              xr = xr xor myPPtr.UInt32( 17 * 4 )
              
              d0 = xr
              d1 = xl
              
              
              mySPtr.UInt32( arrIndex ) = d0
              arrIndex = arrIndex + 4
              mySPtr.UInt32( arrIndex ) = d1
              arrIndex = arrIndex + 4
            next k
            
          next i
        next keyIndex
        
      next rep
    End Sub
  13. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    With the same settings and aggressive compile, my Bcrypt code takes about 250 ms, whereas Bcrypt through PHP takes about 210 ms. I'm trying to find that other 40 ms. :)

  14. Kem T

    Nov 22 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    If you want to see this in context, the code is in this branch:

    https://github.com/ktekinay/M_Crypto/tree/feature/Bcrypt_optimizations

  15. 2 weeks ago

    Greg O

    Nov 25 Xojo Inc
    Edited 2 weeks ago

    @Kem T With the same settings and aggressive compile, my Bcrypt code takes about 250 ms, whereas Bcrypt through PHP takes about 210 ms. I'm trying to find that other 40 ms. :)

    Just curious about your php config...

    Do you use Zend? Does PHP stay in memory all the time? When you checked the speed, had php cached or precompiled the script? Is PHP set up to use multiple cores?

    What I’m getting at is that you’re probably not comparing apples to apples here.

  16. Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I'm running it through the command line with a simple php -r '<script>', but you bring up an interesting point about caching issues as I'm just looping over the same password 20 times and taking the average. I'll think about how to update that so it will have to calculate each time.

  17. Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    Welp, providing an array of 20 passwords certainly made PHP take a bit longer, but it also made my class take longer too, so ...

  18. Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    The method as it stands today:

    Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
      #if not DebugBuild
        #pragma BackgroundTasks False
        #pragma BoundsChecking False
        #pragma NilObjectChecking False
        #pragma StackOverflowChecking False
      #endif
      
      const kLastIndex as integer = BLF_N + 1
      
      const kShift3 as UInt32 = 256 ^ 3
      const kShift2 as UInt32 = 256 ^ 2
      const kShift1 as UInt32 = 256 ^ 1
      
      const kMask0 as UInt32 = &hFF000000
      const kMask1 as UInt32 = &h00FF0000
      const kMask2 as UInt32 = &h0000FF00
      const kMask3 as UInt32 = &h000000FF
      
      const kSLastByte as integer = 4095
      
      dim myPPtr as ptr = PPtr
      dim mySPtr as ptr = SPtr
      dim isLittleEndian as boolean = self.IsLittleEndian
      
      //
      // Create the stream keys
      //
      const kStreamWordSize as integer = 8
      
      #if kDebug then
        dim startMs, elapsedMs as Double
        dim logPrefix as string = CurrentMethodName + ": "
      #endif
      
      #if kDebug then
        startMs = Microseconds
      #endif
      for keyIndex as integer = 0 to keys.Ubound
        dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
        dim keySize as integer = key.Size
        
        if keySize = 0 then
          RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
        end if
        WasKeySet = true
        
        dim streamKey as Xojo.Core.MutableMemoryBlock
        dim streamKeySize as integer = keySize
        if ( keySize mod kStreamWordSize ) = 0 then
          streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
          streamKey.Left( keySize ) = key
        else
          streamKeySize = streamKeySize * kStreamWordSize
          
          streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
          dim lastByte as integer = streamKeySize - 1
          for thisByte as integer = 0 to lastByte step keySize
            streamKey.Mid( thisByte, keySize ) = key
          next
        end if
        
        if isLittleEndian then
          //
          // Swap the bytes
          //
          dim streamKeyPtr as ptr = streamKey.Data
          dim swapIndex as integer
          while swapIndex < streamKeySize
            dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
            streamKeyPtr.UInt32( swapIndex ) = _
            ( temp \ kShift3 ) or _
            ( ( temp and kMask1 ) \ kShift1 ) or _
            ( ( temp and kMask2 ) * kShift1 ) or _
            ( temp * kShift3 )
            
            swapIndex = swapIndex + 4
          wend
        end if
        
        keys( keyIndex ) = streamKey
      next
      
      #if kDebug then
        elapsedMs = Microseconds - startMs
        System.DebugLog logPrefix + "Stream keys took " + format( elapsedMs, "#,0.0##" ) + " µs"
      #endif
      
      for rep as integer = 1 to repetitions
        
        for keyIndex as integer = 0 to keys.Ubound
          
          dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
          dim keyPtr as ptr = key.Data
          dim keySize as integer = key.Size
          
          dim j as integer
          dim i as integer, arrIndex as integer
          dim d0, d1 as UInt32
          
          arrIndex = 0
          dim barrier as integer = keySize - kStreamWordSize
          for i = 0 to kLastIndex step 2 // Two words at a time
            'temp = Stream2Word( key, j, streamBuffer, streamBufferPtr )
            
            myPPtr.UInt64( arrIndex ) = myPPtr.UInt64( arrIndex ) xor keyPtr.UInt64( j )
            if j = barrier then
              j = 0
            else
              j = j + kStreamWordSize
            end if
            
            arrIndex = arrIndex + kStreamWordSize
          next i
          
          dim a, b, c, d as integer // Used as indexes
          dim inner as integer
          dim xl as UInt32 
          dim xr as UInt32 
          dim j1 as UInt32
          dim pptrEncoderIndex as integer
          
          j = 0
          arrIndex = 0
          for i = 0 to kLastIndex step 2
            'self.Encipher( d0, d1 )
            
            xl = d0 xor myPPtr.UInt32( 0 )
            xr = d1
            
            pptrEncoderIndex = 4
            for inner = 1 to 16 step 2
              a = ( xl \ kShift3 )
              b = ( xl \ kShift2 ) and kMask3
              c = ( xl \ kShift1 ) and kMask3
              d = xl and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              pptrEncoderIndex = pptrEncoderIndex + 4
              
              a = ( xr \ kShift3 ) 
              b = ( xr \ kShift2 ) and kMask3
              c = ( xr \ kShift1 ) and kMask3
              d = xr and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              pptrEncoderIndex = pptrEncoderIndex + 4
            next inner
            
            xr = xr xor myPPtr.UInt32( pptrEncoderIndex )
            
            d0 = xr
            d1 = xl
            
            
            myPPtr.UInt32( arrIndex ) = d0
            arrIndex = arrIndex + 4
            myPPtr.UInt32( arrIndex ) = d1
            arrIndex = arrIndex + 4
          next i
          
          dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
          for arrIndex = 0 to kSLastByte step 8
            'self.Encipher( d0, d1 )
            
            xl = d0 xor firstPPtr
            xr = d1
            
            pptrEncoderIndex = 4
            for inner = 1 to 16 step 2
              a = ( xl \ kShift3 )
              b = ( xl \ kShift2 ) and kMask3
              c = ( xl \ kShift1 ) and kMask3
              d = xl and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              pptrEncoderIndex = pptrEncoderIndex + 4
              
              a = ( xr \ kShift3 ) 
              b = ( xr \ kShift2 ) and kMask3
              c = ( xr \ kShift1 ) and kMask3
              d = xr and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              pptrEncoderIndex = pptrEncoderIndex + 4
            next inner
            
            xr = xr xor myPPtr.UInt32( pptrEncoderIndex )
            
            d0 = xr
            d1 = xl
            
            
            mySPtr.UInt32( arrIndex ) = d0
            mySPtr.UInt32( arrIndex + 4 ) = d1
          next arrIndex
          
        next keyIndex
      next rep
    End Sub
  19. Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    My next thought is to convert everything to use UInt64 instead of UInt32. That would make some difference, no?

  20. Karen A

    Nov 25 Pre-Release Testers

    @Kem T My next thought is to convert everything to use UInt64 instead of UInt32. That would make some difference, no?

    Maybe... only testing would tell for sure.

    But even if there is speed increase on a 64 bit OS, is there a chance the code will need to run on a 32 bit OS? (maybe on Windows?) If so, do you care about that?

    - karen

  21. Newer ›

or Sign Up to reply!