Help optimize this code [formerly Why is Ptr...]

  1. ‹ Older
  2. 2 weeks ago

    Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I'll go back and try it again, but I also have to account for big-endian vs. little-endian when using Ptrs so "a" might be the first byte or it might be the last.

    So far switching everything to UInteger has not shown a difference. I'm rather disappointed.

  3. Kem T

    Nov 25 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    Just to complete that loop, I did a substitution like this and the time jumped from around 270 ms to over 300 ms.

    mb4Ptr.UInt32( 0 ) = xr
    a = mb4Ptr.Byte( aIndex )
    b = mb4Ptr.Byte( bIndex )
    c = mb4Ptr.Byte( cIndex )
    d = mb4Ptr.Byte( dIndex )
    'a = ( xr \ kShift3 ) 
    'b = ( xr \ kShift2 ) and kMask3
    'c = ( xr \ kShift1 ) and kMask3
    'd = xr and kMask3

    The mb4Ptr points to a MemoryBlock that is created before the loops and reused.

  4. Kem T

    Nov 26 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    A couple of points I missed:

    @Philippe S About Ubound: In this case, there will only be one or two keys, so I judged the use of the array an acceptable compromise. But in general, I've found Ubound to be harmless to performance. Many years ago I did a test and found that, yes, storing the Ubound value in a variable and using that in the loop over Ubound does make a difference but one that amounts to microseconds over *millions* of iterations.

    @Michael D I think you were referring to the Bitwise module and it's functions, but those are function calls with their associated overhead. Doing the math directly by multiplying or dividing by powers of 2 is somewhat less readable, but does the same thing without the cost of calling a function.

    @Tim H Your example used a MemoryBlock function, which is convenient since it will honor the LittleEndian property, but comes with the overhead of any function call.

  5. Kem T

    Nov 26 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    BTW, I didn't know that shifting left or right was a matter of multiplying or dividing by a power of two until @Norman P mentioned it here some time ago. This forum is a wonderful resource and I'm always grateful for the tidbits I pick up from its members.

    To clarify, if you want to shift bits x places to the left or right, you would multiply or (integer) divide by 2^x.

    shifted = value * ( 2 ^ x ) // Shift left by x bits
    shifted = value \ ( 2 ^ x ) // Shift right by x bits
  6. Kem T

    Nov 26 Pre-Release Testers, Xojo Pro, XDC Speakers New York
    Edited 2 weeks ago

    For those playing at home, this is the latest iteration. It has better named variables and uses byte indexes directly in the loops, but doesn't really move the performance needle. I'm afraid this might be the best I can do unless someone who can read the compiled code can tell me where I might eliminate unneeded instructions.

    Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
      #if not DebugBuild
        #pragma BackgroundTasks False
        #pragma BoundsChecking False
        #pragma NilObjectChecking False
        #pragma StackOverflowChecking False
      #endif
      
      const kZero as UInt32 = 0
      
      const kShift3 as UInt32 = 256 ^ 3
      const kShift2 as UInt32 = 256 ^ 2
      const kShift1 as UInt32 = 256 ^ 1
      
      const kMask0 as UInt32 = &hFF000000
      const kMask1 as UInt32 = &h00FF0000
      const kMask2 as UInt32 = &h0000FF00
      const kMask3 as UInt32 = &h000000FF
      
      const kSLastByte as integer = 4095
      const kPLastByte as integer = ( 18 * 4 ) - 1
      const kPLastInnerByte as integer = kPLastByte - 7
      
      dim myPPtr as ptr = PPtr
      dim mySPtr as ptr = SPtr
      dim isLittleEndian as boolean = self.IsLittleEndian
      
      //
      // Create the stream keys
      //
      const kStreamWordSize as integer = 8
      
      #if kDebug then
        dim startMs, elapsedMs as Double
        dim logPrefix as string = CurrentMethodName + ": "
      #endif
      
      #if kDebug then
        startMs = Microseconds
      #endif
      for keyIndex as integer = 0 to keys.Ubound
        dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
        dim keySize as integer = key.Size
        
        if keySize = 0 then
          RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
        end if
        WasKeySet = true
        
        dim streamKey as Xojo.Core.MutableMemoryBlock
        dim streamKeySize as integer = keySize
        if ( keySize mod kStreamWordSize ) = 0 then
          streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
          streamKey.Left( keySize ) = key
        else
          streamKeySize = streamKeySize * kStreamWordSize
          
          streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
          dim lastByte as integer = streamKeySize - 1
          for thisByte as integer = 0 to lastByte step keySize
            streamKey.Mid( thisByte, keySize ) = key
          next
        end if
        
        if isLittleEndian then
          //
          // Swap the bytes
          //
          dim streamKeyPtr as ptr = streamKey.Data
          dim swapIndex as integer
          while swapIndex < streamKeySize
            dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
            streamKeyPtr.UInt32( swapIndex ) = _
            ( temp \ kShift3 ) or _
            ( ( temp and kMask1 ) \ kShift1 ) or _
            ( ( temp and kMask2 ) * kShift1 ) or _
            ( temp * kShift3 )
            
            swapIndex = swapIndex + 4
          wend
        end if
        
        keys( keyIndex ) = streamKey
      next
      
      #if kDebug then
        elapsedMs = Microseconds - startMs
        System.DebugLog logPrefix + "Stream keys took " + format( elapsedMs, "#,0.0##" ) + " µs"
      #endif
      
      //
      // Encoding starts here
      //
      
      dim a, b, c, d as integer // Used as indexes
      dim xl as UInt32 
      dim xr as UInt32 
      dim j1 as UInt32
      dim d0, d1 as UInt32
      dim barrier as integer
      dim pptrEncoderIndex as integer
      
      dim key as Xojo.Core.MutableMemoryBlock
      dim keyPtr as ptr
      dim keySize as integer
      
      dim streamIndex as integer
      dim sByteIndex, pByteIndex as integer
      dim keyIndex as integer
      
      for rep as integer = 1 to repetitions
        
        for keyIndex = 0 to keys.Ubound
          
          key = keys( keyIndex )
          keyPtr = key.Data
          keySize = key.Size
          
          barrier = keySize - kStreamWordSize
          streamIndex = 0
          for pByteIndex = 0 to kPLastByte step kStreamWordSize // Two words at a time
            'temp = Stream2Word( key, streamIndex, streamBuffer, streamBufferPtr )
            
            myPPtr.UInt64( pByteIndex ) = myPPtr.UInt64( pByteIndex ) xor keyPtr.UInt64( streamIndex )
            if streamIndex = barrier then
              streamIndex = 0
            else
              streamIndex = streamIndex + kStreamWordSize
            end if
          next pByteIndex
          
          d0 = kZero
          d1 = kZero
          
          for pByteIndex = 0 to kPLastByte step 8
            'self.Encipher( d0, d1 )
            
            xl = d0 xor myPPtr.UInt32( 0 )
            xr = d1
            
            for pptrEncoderIndex = 4 to kPLastInnerByte step 8
              a = ( xl \ kShift3 )
              b = ( xl \ kShift2 ) and kMask3
              c = ( xl \ kShift1 ) and kMask3
              d = xl and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              
              a = ( xr \ kShift3 ) 
              b = ( xr \ kShift2 ) and kMask3
              c = ( xr \ kShift1 ) and kMask3
              d = xr and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
            next pptrEncoderIndex
            
            xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
            
            d0 = xr
            d1 = xl
            
            
            myPPtr.UInt32( pByteIndex ) = d0
            myPPtr.UInt32( pByteIndex + 4 ) = d1
          next pByteIndex
          
          dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
          for sByteIndex = 0 to kSLastByte step 8
            'self.Encipher( d0, d1 )
            
            xl = d0 xor firstPPtr
            xr = d1
            
            for pptrEncoderIndex = 4 to kPLastInnerByte step 8
              a = ( xl \ kShift3 )
              b = ( xl \ kShift2 ) and kMask3
              c = ( xl \ kShift1 ) and kMask3
              d = xl and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
              
              a = ( xr \ kShift3 ) 
              b = ( xr \ kShift2 ) and kMask3
              c = ( xr \ kShift1 ) and kMask3
              d = xr and kMask3
              
              j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
              xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
              + mySPtr.UInt32( ( 768 + d ) * 4 )
              
              xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
            next pptrEncoderIndex
            
            xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
            
            d0 = xr
            d1 = xl
            
            
            mySPtr.UInt32( sByteIndex ) = d0
            mySPtr.UInt32( sByteIndex + 4 ) = d1
          next sByteIndex
          
        next keyIndex
      next rep
    End Sub
  7. Kem T

    Nov 27 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I think I've hit a wall here. I'm no assembly guru but when I disassemble my app, it looks like it's using instructions to right shift instead of division, but I'm sure there is an opportunity to optimize at that level. Hopefully one of the engineers is inspired to look and it leads to something, but I can't see anything else at the language level.

  8. Tim H

    Nov 27 Pre-Release Testers Portland, OR USA

    If it's right shifting, that's about as optimized as you can get. I'm impressed that it would do that.

  9. @Kem T

    Division is, from a hardware perspective, the operation with the worst performance,
    as it needs to be implemented by a rather slow algorithm, not much unlike the one
    used in education.

    In some cases, like division by positive integer powers of 2, a clever compiler might be,
    as seems to be in your case, able to replace the divide instruction by a shift-right, which
    is much faster ("direct in hardware or microprogrammed").

    (
    Although the later implementations of the x86 processors still have to implement all
    the overhead of instructions of eons ago, limiting the effectiveness of many often
    used instructions, shift-instructions are still usefull there.

    Unfortunately, Apple decided to stop with the PowerPC series of processors, that had
    an almost new design, without the overhead of Intel's past, like the very complex
    protected mode system (four rings and more) etc.
    )

  10. Christian S

    Nov 29 Pre-Release Testers, Xojo Pro, XDC Speakers Germany

    I bet soon we get the first Intel CPUs with 64-bit only, so they can kill all the 32-bit legacy.

  11. @ChristianSchmitz

    That would be nice, but the overhead of the complexity of the layered protection-model
    would still be there.

    Also, the Intel CPU has always been CISC, not RISC-like as the Motorola/IBM/Apple PowerPC.

    It is not just 32 bit, but even 16 bit legacy ...

  12. Greg O

    Nov 29 Xojo Inc

    Kem,

    You’re comparing Xojo to PHP, right? PHP does not have scoped variables. You might want to try declaring all of the variables up front for that first loop.

  13. Kem T

    Nov 29 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I'll do that and report back, thanks.

  14. Kem T

    Nov 29 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    No difference. I even unrolled the encipher loop to emulate the C code I copied.

  15. Greg O

    Nov 29 Xojo Inc

    Another thing to consider is that the xojo framework has a bit of code for exception handling and variable allocation/deallocation/cleanup that other languages (like C) don’t include automatically.

  16. Kem T

    Nov 29 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I don't understand assembly well enough to know if that's it, but the pseudo-code generated by Hopper when compared to the same function written in C seems a lot more verbose.

  17. Kem T

    Nov 30 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    This is the pseudo-code from the C-compiled utility:

        var_14 = *(int32_t *)arg1;
        var_18 = *(int32_t *)(arg1 + 0x4);
        var_14 = *(int32_t *)(arg0 + 0x1000) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1004) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1008) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x100c) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1010) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1014) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1018) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x101c) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1020) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1024) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1028) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x102c) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1030) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1034) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1038) ^ var_14;
        var_18 = (*(int32_t *)(arg0 + (var_14 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_14 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_14 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_14 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x103c) ^ var_18;
        var_14 = (*(int32_t *)(arg0 + (var_18 >> 0x18 & 0xff) * 0x4) + *(int32_t *)(arg0 + ((var_18 >> 0x10 & 0xff) + 0x100) * 0x4) ^ *(int32_t *)(arg0 + ((var_18 >> 0x8 & 0xff) + 0x200) * 0x4)) + *(int32_t *)(arg0 + ((var_18 & 0xff) + 0x300) * 0x4) ^ *(int32_t *)(arg0 + 0x1040) ^ var_14;
        *(int32_t *)arg1 = var_18 ^ *(int32_t *)(arg0 + 0x1044);
  18. Kem T

    Nov 30 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    After manually unrolling the loop and rewriting the calculations to match what was done in C, this is how my app appeared:

                rcx = rax ^ *(int32_t *)r10;
                rbx = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rbx = rbx + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rbx = rbx ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rbx = rbx + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x4);
                rdi = *(int32_t *)(r15 + ((rdx ^ rbx) >> 0xe & 0x3fc | 0x400));
                rdi = rdi + *(int32_t *)(r15 + ((rdx ^ rbx) >> 0x18) * 0x4);
                rdi = rdi ^ *(int32_t *)(r15 + (HIBYTE(rdx ^ rbx) & 0xff) * 0x4 + 0x800);
                rdi = rdi + *(int32_t *)(r15 + ((rdx ^ rbx) & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x8);
                rsi = *(int32_t *)(r15 + ((rcx ^ rdi) >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + ((rcx ^ rdi) >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx ^ rdi) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + ((rcx ^ rdi) & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ rbx ^ *(int32_t *)(r10 + 0xc);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ rdi ^ *(int32_t *)(r10 + 0x10);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x14);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x18);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x1c);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x20);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x24);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x28);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x2c);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x30);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x34);
                rdx = rdx ^ rsi;
                rsi = *(int32_t *)(r15 + (rdx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rdx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rdx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rdx & 0xff) * 0x4 + 0xc00);
                rcx = rcx ^ *(int32_t *)(r10 + 0x38);
                rcx = rcx ^ rsi;
                rsi = *(int32_t *)(r15 + (rcx >> 0xe & 0x3fc | 0x400));
                rsi = rsi + *(int32_t *)(r15 + (rcx >> 0x18) * 0x4);
                rsi = rsi ^ *(int32_t *)(r15 + (HIBYTE(rcx) & 0xff) * 0x4 + 0x800);
                rsi = rsi + *(int32_t *)(r15 + (rcx & 0xff) * 0x4 + 0xc00);
                rdx = rdx ^ *(int32_t *)(r10 + 0x3c);
                rbx = rcx ^ *(int32_t *)(r10 + 0x40) ^ (*(int32_t *)(r15 + ((rdx ^ rsi) >> 0xe & 0x3fc | 0x400)) + *(int32_t *)(r15 + ((rdx ^ rsi) >> 0x18) * 0x4) ^ *(int32_t *)(r15 + (HIBYTE(rdx ^ rsi) & 0xff) * 0x4 + 0x800)) + *(int32_t *)(r15 + ((rdx ^ rsi) & 0xff) * 0x4 + 0xc00);
                rcx = rdx ^ rsi ^ *(int32_t *)(r10 + 0x44);
  19. Kem T

    Nov 30 Pre-Release Testers, Xojo Pro, XDC Speakers New York
    Edited 2 weeks ago

    This is what the Xojo code looks like with constants and variables typed as UInt32:

    temp = xl
    xl = xr xor myPPtr.UInt32( 0 )
    xr = temp
    
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 1 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 2 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 3 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 4 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 5 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 6 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 7 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 8 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 9 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 10 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 11 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 12 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 13 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 14 * 4 ) )
    xr = xr xor ( ( ( ( ( mySPtr.UInt32( ( xl \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xl \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xl \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xl and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 15 * 4 ) )
    xl = xl xor ( ( ( ( ( mySPtr.UInt32( ( xr \ kShift3 ) * CType( 4, UInt32 ) ) + _
        mySPtr.UInt32( ( ( xr \ kShift2 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 1024, UInt32 ) ) ) xor _
        mySPtr.UInt32( ( ( xr \ kShift1 ) and kMask3 ) * CType( 4, UInt32 ) + CType( 2048, UInt32 ) ) ) + _
        mySPtr.UInt32( ( xr and kMask3 ) * CType( 4, UInt32 ) + CType( 3072, UInt32 ) ) ) ) xor _
        myPPtr.UInt32( 16 * 4 ) )
    
    xr = xr xor myPPtr.UInt32( 17 * 4 )
  20. Kem T

    Nov 30 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I wonder if all the switching from UInt32 to Integer for the indexes is playing into this...

  21. Kem T

    Nov 30 Pre-Release Testers, Xojo Pro, XDC Speakers New York

    I think I'm going to have to bite the bullet and write a second 64-bit version of the code that uses UInt64 and Integer rather than UInt32. I'm just afraid I'll do all that work and it won't make a difference.

or Sign Up to reply!