I will make sure it still runs in 32 bit, but that’s it. It might be slow as heck.
May be remove the Ubound calls and do you need to reallocate new memoryblocks each time (streamkey) ?
Yes, but that that l takes around a microsecond.
Change
a = ( xl \\ kShift3 )
b = ( xl \\ kShift2 ) and kMask3
c = ( xl \\ kShift1 ) and kMask3
d = xl and kMask3
to
mb.UInt32Value(0)= xl // earlier dim mb as new MemoryBlock(4)
a= mb.byte(0)
b= mb.byte(1)
c= mb.byte(2)
d= mb.byte(3)
I thought the same thing and that’s how it was originally, but splitting it this way is faster.
It was broken and still is broken - it was documented as working, and it compiles, but does nothing. They have changed the documentation, but it still compiles and does nothing.
I agree - in the sense that “True/False” is fine - we don’t need another way to express booleans using On/Off. Just adds confusion.
Nice job on the optimizations. Between these three options (accessing bytes from a memoryblock, dividing by 2**N bits, or shifting N bits), one would think that the bitShift would be the fastest, but this is likely not the case.
It must be a compiler difference, then. Using a memoryblock was twice as fast, but I wasn’t using the optimizing compiler.
I’ll go back and try it again, but I also have to account for big-endian vs. little-endian when using Ptrs so “a” might be the first byte or it might be the last.
So far switching everything to UInteger has not shown a difference. I’m rather disappointed.
Just to complete that loop, I did a substitution like this and the time jumped from around 270 ms to over 300 ms.
mb4Ptr.UInt32( 0 ) = xr
a = mb4Ptr.Byte( aIndex )
b = mb4Ptr.Byte( bIndex )
c = mb4Ptr.Byte( cIndex )
d = mb4Ptr.Byte( dIndex )
'a = ( xr \\ kShift3 )
'b = ( xr \\ kShift2 ) and kMask3
'c = ( xr \\ kShift1 ) and kMask3
'd = xr and kMask3
The mb4Ptr points to a MemoryBlock that is created before the loops and reused.
A couple of points I missed:
@Philippe Schmid About Ubound: In this case, there will only be one or two keys, so I judged the use of the array an acceptable compromise. But in general, I’ve found Ubound to be harmless to performance. Many years ago I did a test and found that, yes, storing the Ubound value in a variable and using that in the loop over Ubound does make a difference but one that amounts to microseconds over millions of iterations.
@Michael Diehr I think you were referring to the Bitwise module and it’s functions, but those are function calls with their associated overhead. Doing the math directly by multiplying or dividing by powers of 2 is somewhat less readable, but does the same thing without the cost of calling a function.
@Tim Hare Your example used a MemoryBlock function, which is convenient since it will honor the LittleEndian property, but comes with the overhead of any function call.
BTW, I didn’t know that shifting left or right was a matter of multiplying or dividing by a power of two until @Norman Palardy mentioned it here some time ago. This forum is a wonderful resource and I’m always grateful for the tidbits I pick up from its members.
To clarify, if you want to shift bits x places to the left or right, you would multiply or (integer) divide by 2^x.
shifted = value * ( 2 ^ x ) // Shift left by x bits
shifted = value \\ ( 2 ^ x ) // Shift right by x bits
For those playing at home, this is the latest iteration. It has better named variables and uses byte indexes directly in the loops, but doesn’t really move the performance needle. I’m afraid this might be the best I can do unless someone who can read the compiled code can tell me where I might eliminate unneeded instructions.
Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
#if not DebugBuild
#pragma BackgroundTasks False
#pragma BoundsChecking False
#pragma NilObjectChecking False
#pragma StackOverflowChecking False
#endif
const kZero as UInt32 = 0
const kShift3 as UInt32 = 256 ^ 3
const kShift2 as UInt32 = 256 ^ 2
const kShift1 as UInt32 = 256 ^ 1
const kMask0 as UInt32 = &hFF000000
const kMask1 as UInt32 = &h00FF0000
const kMask2 as UInt32 = &h0000FF00
const kMask3 as UInt32 = &h000000FF
const kSLastByte as integer = 4095
const kPLastByte as integer = ( 18 * 4 ) - 1
const kPLastInnerByte as integer = kPLastByte - 7
dim myPPtr as ptr = PPtr
dim mySPtr as ptr = SPtr
dim isLittleEndian as boolean = self.IsLittleEndian
//
// Create the stream keys
//
const kStreamWordSize as integer = 8
#if kDebug then
dim startMs, elapsedMs as Double
dim logPrefix as string = CurrentMethodName + ": "
#endif
#if kDebug then
startMs = Microseconds
#endif
for keyIndex as integer = 0 to keys.Ubound
dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
dim keySize as integer = key.Size
if keySize = 0 then
RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
end if
WasKeySet = true
dim streamKey as Xojo.Core.MutableMemoryBlock
dim streamKeySize as integer = keySize
if ( keySize mod kStreamWordSize ) = 0 then
streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
streamKey.Left( keySize ) = key
else
streamKeySize = streamKeySize * kStreamWordSize
streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
dim lastByte as integer = streamKeySize - 1
for thisByte as integer = 0 to lastByte step keySize
streamKey.Mid( thisByte, keySize ) = key
next
end if
if isLittleEndian then
//
// Swap the bytes
//
dim streamKeyPtr as ptr = streamKey.Data
dim swapIndex as integer
while swapIndex < streamKeySize
dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
streamKeyPtr.UInt32( swapIndex ) = _
( temp \\ kShift3 ) or _
( ( temp and kMask1 ) \\ kShift1 ) or _
( ( temp and kMask2 ) * kShift1 ) or _
( temp * kShift3 )
swapIndex = swapIndex + 4
wend
end if
keys( keyIndex ) = streamKey
next
#if kDebug then
elapsedMs = Microseconds - startMs
System.DebugLog logPrefix + "Stream keys took " + format( elapsedMs, "#,0.0##" ) + " µs"
#endif
//
// Encoding starts here
//
dim a, b, c, d as integer // Used as indexes
dim xl as UInt32
dim xr as UInt32
dim j1 as UInt32
dim d0, d1 as UInt32
dim barrier as integer
dim pptrEncoderIndex as integer
dim key as Xojo.Core.MutableMemoryBlock
dim keyPtr as ptr
dim keySize as integer
dim streamIndex as integer
dim sByteIndex, pByteIndex as integer
dim keyIndex as integer
for rep as integer = 1 to repetitions
for keyIndex = 0 to keys.Ubound
key = keys( keyIndex )
keyPtr = key.Data
keySize = key.Size
barrier = keySize - kStreamWordSize
streamIndex = 0
for pByteIndex = 0 to kPLastByte step kStreamWordSize // Two words at a time
'temp = Stream2Word( key, streamIndex, streamBuffer, streamBufferPtr )
myPPtr.UInt64( pByteIndex ) = myPPtr.UInt64( pByteIndex ) xor keyPtr.UInt64( streamIndex )
if streamIndex = barrier then
streamIndex = 0
else
streamIndex = streamIndex + kStreamWordSize
end if
next pByteIndex
d0 = kZero
d1 = kZero
for pByteIndex = 0 to kPLastByte step 8
'self.Encipher( d0, d1 )
xl = d0 xor myPPtr.UInt32( 0 )
xr = d1
for pptrEncoderIndex = 4 to kPLastInnerByte step 8
a = ( xl \\ kShift3 )
b = ( xl \\ kShift2 ) and kMask3
c = ( xl \\ kShift1 ) and kMask3
d = xl and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
a = ( xr \\ kShift3 )
b = ( xr \\ kShift2 ) and kMask3
c = ( xr \\ kShift1 ) and kMask3
d = xr and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
next pptrEncoderIndex
xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
d0 = xr
d1 = xl
myPPtr.UInt32( pByteIndex ) = d0
myPPtr.UInt32( pByteIndex + 4 ) = d1
next pByteIndex
dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
for sByteIndex = 0 to kSLastByte step 8
'self.Encipher( d0, d1 )
xl = d0 xor firstPPtr
xr = d1
for pptrEncoderIndex = 4 to kPLastInnerByte step 8
a = ( xl \\ kShift3 )
b = ( xl \\ kShift2 ) and kMask3
c = ( xl \\ kShift1 ) and kMask3
d = xl and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xr = xr xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex ) )
a = ( xr \\ kShift3 )
b = ( xr \\ kShift2 ) and kMask3
c = ( xr \\ kShift1 ) and kMask3
d = xr and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xl = xl xor ( j1 xor myPPtr.UInt32( pptrEncoderIndex + 4 ) )
next pptrEncoderIndex
xr = xr xor myPPtr.UInt32( kPLastByte - 3 )
d0 = xr
d1 = xl
mySPtr.UInt32( sByteIndex ) = d0
mySPtr.UInt32( sByteIndex + 4 ) = d1
next sByteIndex
next keyIndex
next rep
End Sub
I think I’ve hit a wall here. I’m no assembly guru but when I disassemble my app, it looks like it’s using instructions to right shift instead of division, but I’m sure there is an opportunity to optimize at that level. Hopefully one of the engineers is inspired to look and it leads to something, but I can’t see anything else at the language level.
If it’s right shifting, that’s about as optimized as you can get. I’m impressed that it would do that.
@Kem T
Division is, from a hardware perspective, the operation with the worst performance,
as it needs to be implemented by a rather slow algorithm, not much unlike the one
used in education.
In some cases, like division by positive integer powers of 2, a clever compiler might be,
as seems to be in your case, able to replace the divide instruction by a shift-right, which
is much faster (“direct in hardware or microprogrammed”).
(
Although the later implementations of the x86 processors still have to implement all
the overhead of instructions of eons ago, limiting the effectiveness of many often
used instructions, shift-instructions are still usefull there.
Unfortunately, Apple decided to stop with the PowerPC series of processors, that had
an almost new design, without the overhead of Intel’s past, like the very complex
protected mode system (four rings and more) etc.
)
I bet soon we get the first Intel CPUs with 64-bit only, so they can kill all the 32-bit legacy.
@Christian Schmitz
That would be nice, but the overhead of the complexity of the layered protection-model
would still be there.
Also, the Intel CPU has always been CISC, not RISC-like as the Motorola/IBM/Apple PowerPC.
It is not just 32 bit, but even 16 bit legacy …
Kem,
Youre comparing Xojo to PHP, right? PHP does not have scoped variables. You might want to try declaring all of the variables up front for that first loop.
I’ll do that and report back, thanks.