Help optimize this code [formerly Why is Ptr...]

Kem_Tekinay · November 25, 2018, 11:01pm

I will make sure it still runs in 32 bit, but that’s it. It might be slow as heck.

Philippe_Schmid · November 25, 2018, 11:12pm

May be remove the Ubound calls and do you need to reallocate new memoryblocks each time (streamkey) ?

Kem_Tekinay · November 25, 2018, 11:24pm

Yes, but that that l takes around a microsecond.

Tim_Hare · November 25, 2018, 11:52pm

Change

a = ( xl \\ kShift3 )
b = ( xl \\ kShift2 ) and kMask3
c = ( xl \\ kShift1 ) and kMask3
d = xl and kMask3

to

mb.UInt32Value(0)= xl     // earlier dim mb as new MemoryBlock(4)
a= mb.byte(0)
b= mb.byte(1)
c= mb.byte(2)
d= mb.byte(3)

Kem_Tekinay · November 26, 2018, 12:02am

I thought the same thing and that’s how it was originally, but splitting it this way is faster.

Mike_D · November 26, 2018, 12:11am

It was broken and still is broken - it was documented as working, and it compiles, but does nothing. They have changed the documentation, but it still compiles and does nothing.

I agree - in the sense that “True/False” is fine - we don’t need another way to express booleans using On/Off. Just adds confusion.

Mike_D · November 26, 2018, 12:13am

Nice job on the optimizations. Between these three options (accessing bytes from a memoryblock, dividing by 2**N bits, or shifting N bits), one would think that the bitShift would be the fastest, but this is likely not the case.

Tim_Hare · November 26, 2018, 6:18am

It must be a compiler difference, then. Using a memoryblock was twice as fast, but I wasn’t using the optimizing compiler.

Kem_Tekinay · November 26, 2018, 6:23am

I’ll go back and try it again, but I also have to account for big-endian vs. little-endian when using Ptrs so “a” might be the first byte or it might be the last.

So far switching everything to UInteger has not shown a difference. I’m rather disappointed.

Kem_Tekinay · November 26, 2018, 7:13am

Just to complete that loop, I did a substitution like this and the time jumped from around 270 ms to over 300 ms.

mb4Ptr.UInt32( 0 ) = xr
a = mb4Ptr.Byte( aIndex )
b = mb4Ptr.Byte( bIndex )
c = mb4Ptr.Byte( cIndex )
d = mb4Ptr.Byte( dIndex )
'a = ( xr \\ kShift3 ) 
'b = ( xr \\ kShift2 ) and kMask3
'c = ( xr \\ kShift1 ) and kMask3
'd = xr and kMask3

The mb4Ptr points to a MemoryBlock that is created before the loops and reused.

Kem_Tekinay · November 26, 2018, 3:14pm

A couple of points I missed:

@Philippe Schmid About Ubound: In this case, there will only be one or two keys, so I judged the use of the array an acceptable compromise. But in general, I’ve found Ubound to be harmless to performance. Many years ago I did a test and found that, yes, storing the Ubound value in a variable and using that in the loop over Ubound does make a difference but one that amounts to microseconds over millions of iterations.

@Michael Diehr I think you were referring to the Bitwise module and it’s functions, but those are function calls with their associated overhead. Doing the math directly by multiplying or dividing by powers of 2 is somewhat less readable, but does the same thing without the cost of calling a function.

@Tim Hare Your example used a MemoryBlock function, which is convenient since it will honor the LittleEndian property, but comes with the overhead of any function call.

Kem_Tekinay · November 26, 2018, 3:56pm

BTW, I didn’t know that shifting left or right was a matter of multiplying or dividing by a power of two until @Norman Palardy mentioned it here some time ago. This forum is a wonderful resource and I’m always grateful for the tidbits I pick up from its members.

To clarify, if you want to shift bits x places to the left or right, you would multiply or (integer) divide by 2^x.

shifted = value * ( 2 ^ x ) // Shift left by x bits
shifted = value \\ ( 2 ^ x ) // Shift right by x bits

Kem_Tekinay · November 26, 2018, 6:44pm

For those playing at home, this is the latest iteration. It has better named variables and uses byte indexes directly in the loops, but doesn’t really move the performance needle. I’m afraid this might be the best I can do unless someone who can read the compiled code can tell me where I might eliminate unneeded instructions.

Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
  #if not DebugBuild
    #pragma BackgroundTasks False
    #pragma BoundsChecking False
    #pragma NilObjectChecking False
    #pragma StackOverflowChecking False
  #endif
  
  const kZero as UInt32 = 0
  
  const kShift3 as UInt32 = 256 ^ 3
  const kShift2 as UInt32 = 256 ^ 2
  const kShift1 as UInt32 = 256 ^ 1
  
  const kMask0 as UInt32 = &hFF000000
  const kMask1 as UInt32 = &h00FF0000
  const kMask2 as UInt32 = &h0000FF00
  const kMask3 as UInt32 = &h000000FF
  
  const kSLastByte as integer = 4095
  const kPLastByte as integer = ( 18 * 4 ) - 1
  const kPLastInnerByte as integer = kPLastByte - 7
  
  dim myPPtr as ptr = PPtr
  dim mySPtr as ptr = SPtr
  dim isLittleEndian as boolean = self.IsLittleEndian
  
  //
  // Create the stream keys
  //
  const kStreamWordSize as integer = 8
  
  #if kDebug then
    dim startMs, elapsedMs as Double
    dim logPrefix as string = CurrentMethodName + ": "
  #endif
  
  #if kDebug then
    startMs = Microseconds
  #endif
  for keyIndex as integer = 0 to keys.Ubound
    dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
    dim keySize as integer = key.Size
    
    if keySize = 0 then
      RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
    end if
    WasKeySet = true
    
    dim streamKey as Xojo.Core.MutableMemoryBlock
    dim streamKeySize as integer = keySize
    if ( keySize mod kStreamWordSize ) = 0 then
      streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
      streamKey.Left( keySize ) = key
    else
      streamKeySize = streamKeySize * kStreamWordSize
      
      streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
      dim lastByte as integer = streamKeySize - 1
      for thisByte as integer = 0 to lastByte step keySize
        streamKey.Mid( thisByte, keySize ) = key
      next
    end if
    
    if isLittleEndian then
      //
      // Swap the bytes
      //
      dim streamKeyPtr as ptr = streamKey.Data
      dim swapIndex as integer
      while swapIndex < streamKeySize
        dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
        streamKeyPtr.UInt32( swapIndex ) = _
        ( temp \\ kShift3 ) or _
        ( ( temp and kMask1 ) \\ kShift1 ) or _
        ( ( temp and kMask2 ) * kShift1 ) or _
        ( temp * kShift3 )
        
        swapIndex = swapIndex + 4
      wend
    end if
    
    keys( keyIndex ) = streamKey
  next
  
  #if kDebug then
    elapsedMs = Microseconds - startMs
    System.DebugLog logPrefix + "Stream keys took " + format( elapsedMs, "#,0.0##" ) + " µs"
  #endif
  
  //
  // Encoding starts here
  //
  
  dim a, b, c, d as integer // Used as indexes
  dim xl as UInt32 
  dim xr as UInt32 
  dim j1 as UInt32
  dim d0, d1 as UInt32
  dim barrier as integer
  dim pptrEncoderIndex as integer
  
  dim key as Xojo.Core.MutableMemoryBlock
  dim keyPtr as ptr
  dim keySize as integer
  
  dim streamIndex as integer
  dim sByteIndex, pByteIndex as integer
  dim keyIndex as integer
  
  for rep as integer = 1 to repetitions
    
    for keyIndex = 0 to keys.Ubound
      
      key = keys( keyIndex )
      keyPtr = key.Data
      keySize = key.Size
      
      barrier = keySize - kStreamWordSize
      streamIndex = 0
      for pByteIndex = 0 to kPLastByte step kStreamWordSize // Two words at a time
        'temp = Stream2Word( key, streamIndex, streamBuffer, streamBufferPtr )
        
        myPPtr.UInt64( pByteIndex ) = myPPtr.UInt64( pByteIndex ) xor keyPtr.UInt64( streamIndex )
        if streamIndex = barrier then
          streamIndex = 0
        else
          streamIndex = streamIndex + kStreamWordSize
        end if
      next pByteIndex
      
      d0 = kZero
      d1 = kZero
      
      for pByteIndex = 0 to kPLastByte step 8
        'self.Encipher( d0, d1 )
        
        xl = d0 xor myPPtr.UInt32( 0 )
        xr = d1
        
        for pptrEncoderIndex = 4 to kPLastInnerByte step 8
          a = ( xl \\ kShift3 )
          b = ( xl \\ kShift2 ) and kMask3
          c = ( xl \\ kShift1 ) and kMask3
          d = xl and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          
          a = ( xr \\ kShift3 ) 
          b = ( xr \\ kShift2 ) and kMask3
          c = ( xr \\ kShift1 ) and kMask3
          d = xr and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
        next pptrEncoderIndex
        
        xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
        
        d0 = xr
        d1 = xl
        
        
        myPPtr.UInt32( pByteIndex ) = d0
        myPPtr.UInt32( pByteIndex + 4 ) = d1
      next pByteIndex
      
      dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
      for sByteIndex = 0 to kSLastByte step 8
        'self.Encipher( d0, d1 )
        
        xl = d0 xor firstPPtr
        xr = d1
        
        for pptrEncoderIndex = 4 to kPLastInnerByte step 8
          a = ( xl \\ kShift3 )
          b = ( xl \\ kShift2 ) and kMask3
          c = ( xl \\ kShift1 ) and kMask3
          d = xl and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
          
          a = ( xr \\ kShift3 ) 
          b = ( xr \\ kShift2 ) and kMask3
          c = ( xr \\ kShift1 ) and kMask3
          d = xr and kMask3
          
          j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
          xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
          + mySPtr.UInt32( ( 768 + d ) * 4 )
          
          xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
        next pptrEncoderIndex
        
        xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
        
        d0 = xr
        d1 = xl
        
        
        mySPtr.UInt32( sByteIndex ) = d0
        mySPtr.UInt32( sByteIndex + 4 ) = d1
      next sByteIndex
      
    next keyIndex
  next rep
End Sub

Kem_Tekinay · November 27, 2018, 6:30pm

I think I’ve hit a wall here. I’m no assembly guru but when I disassemble my app, it looks like it’s using instructions to right shift instead of division, but I’m sure there is an opportunity to optimize at that level. Hopefully one of the engineers is inspired to look and it leads to something, but I can’t see anything else at the language level.

Tim_Hare · November 27, 2018, 7:59pm

If it’s right shifting, that’s about as optimized as you can get. I’m impressed that it would do that.

J_H_Timmerman · November 29, 2018, 1:17pm

@Kem T

Division is, from a hardware perspective, the operation with the worst performance,
as it needs to be implemented by a rather slow algorithm, not much unlike the one
used in education.

In some cases, like division by positive integer powers of 2, a clever compiler might be,
as seems to be in your case, able to replace the divide instruction by a shift-right, which
is much faster (“direct in hardware or microprogrammed”).

(
Although the later implementations of the x86 processors still have to implement all
the overhead of instructions of eons ago, limiting the effectiveness of many often
used instructions, shift-instructions are still usefull there.

Unfortunately, Apple decided to stop with the PowerPC series of processors, that had
an almost new design, without the overhead of Intel’s past, like the very complex
protected mode system (four rings and more) etc.
)

Christian_Schmitz · November 29, 2018, 4:25pm

I bet soon we get the first Intel CPUs with 64-bit only, so they can kill all the 32-bit legacy.

J_H_Timmerman · November 29, 2018, 4:48pm

@Christian Schmitz

That would be nice, but the overhead of the complexity of the layered protection-model
would still be there.

Also, the Intel CPU has always been CISC, not RISC-like as the Motorola/IBM/Apple PowerPC.

It is not just 32 bit, but even 16 bit legacy …

Greg_O_Lone · November 30, 2018, 7:17am

Kem,

Youre comparing Xojo to PHP, right? PHP does not have scoped variables. You might want to try declaring all of the variables up front for that first loop.

Kem_Tekinay · November 30, 2018, 7:18am

I’ll do that and report back, thanks.