I am optimizing my Bcrypt code trying to get it closer in performance to, say. PHP, and have discovered something surprising. In the main time-sucking method, if I comment out the lines that write back to a MemoryBlock via Ptr.UInt32( index ), my test code runs (and fails) in about 18 ms. Once I put those lines back, it takes about 250 ms. Mind you, the MemoryBlock is 4,096 bytes and I have to cycle through every one of them several times, but the code reads from that MemoryBlock too (several times on each pass, in fact), and reading doesn’t suffer the same performance issues as writing.
But I am using a ptr, which was the point of the original question.
Someone mentioned that perhaps commenting those lines is skewing my results if the compiler realizes the calculations aren’t needed and eliminates the loop entirely. I’ll test that in a bit.
Beware that #pragma BackgroundTasks OFF is broken: <https://xojo.com/issue/53396> - this confused me when doing some performance testing. #Pragma BackgroundTasks False works.
Well, this is embarrassing. It turns out that @Jeremy Cowgar was right, the compiler optimized away my loop when I commented out the part where the values are written back to the MemoryBlock, so it’s not that the writes are particularly slow at all!
This is the function if anyone wants to make suggestions. Note that this has been inlined intentionally to avoid the overhead of function calls.
Private Sub Expand0State(repetitions As Integer, ParamArray keys() As Xojo.Core.MutableMemoryBlock)
#if not DebugBuild
#pragma BackgroundTasks False
#pragma BoundsChecking False
#pragma NilObjectChecking False
#pragma StackOverflowChecking False
#endif
const kLastIndex as integer = BLF_N + 1 // = 17
const kShift3 as UInt32 = 256 ^ 3
const kShift2 as UInt32 = 256 ^ 2
const kShift1 as UInt32 = 256 ^ 1
const kMask0 as UInt32 = &hFF000000
const kMask1 as UInt32 = &h00FF0000
const kMask2 as UInt32 = &h0000FF00
const kMask3 as UInt32 = &h000000FF
dim myPPtr as ptr = PPtr
dim mySPtr as ptr = SPtr
//
// Create the stream keys
//
for keyIndex as integer = 0 to keys.Ubound
dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
dim keySize as integer = key.Size
if keySize = 0 then
RaiseErrorIf( true, kErrorKeyCannotBeEmpty )
end if
WasKeySet = true
dim streamKey as Xojo.Core.MutableMemoryBlock
dim streamKeySize as integer = keySize
if ( keySize mod 4 ) = 0 then
streamKey = new Xojo.Core.MutableMemoryBlock( keySize )
streamKey.Left( keySize ) = key
else
streamKeySize = streamKeySize * 4
streamKey = new Xojo.Core.MutableMemoryBlock( streamKeySize )
streamKey.Left( keySize ) = key
streamKey.Mid( keySize, keySize ) = key
streamKey.Mid( keySize + keySize, keySize ) = key
streamKey.Right( keySize ) = key
end if
if IsLittleEndian then
//
// Swap the bytes
//
dim streamKeyPtr as ptr = streamKey.Data
dim swapIndex as integer
while swapIndex < streamKeySize
dim temp as UInt32 = streamKeyPtr.UInt32( swapIndex )
streamKeyPtr.UInt32( swapIndex ) = _
( temp \\ kShift3 ) or _
( ( temp and kMask1 ) \\ kShift1 ) or _
( ( temp and kMask2 ) * kShift1 ) or _
( temp * kShift3 )
swapIndex = swapIndex + 4
wend
end if
keys( keyIndex ) = streamKey
next
for rep as integer = 1 to repetitions
for keyIndex as integer = 0 to keys.Ubound
dim key as Xojo.Core.MutableMemoryBlock = keys( keyIndex )
dim keyPtr as ptr = key.Data
dim keySize as integer = key.Size
dim j as integer
dim i, k as integer, arrIndex as integer
dim temp as UInt32
dim d0, d1 as UInt32
for i = 0 to kLastIndex
'temp = Stream2Word( key, j, streamBuffer, streamBufferPtr )
if j = keySize then
j = 0
end if
temp = keyPtr.UInt32( j )
j = j + 4
arrIndex = i * 4
myPPtr.UInt32( arrIndex ) = myPPtr.UInt32( arrIndex ) xor temp
next i
dim a, b, c, d as integer // Used as indexes
dim inner as integer
dim xl as UInt32
dim xr as UInt32
dim j1 as UInt32
j = 0
arrIndex = 0
for i = 0 to kLastIndex step 2
'self.Encipher( d0, d1 )
xl = d0
xr = d1
xl = xl xor myPPtr.UInt32( 0 )
for inner = 1 to 16 step 2
j1 = xl
a = ( j1 \\ kShift3 )
b = ( j1 \\ kShift2 ) and kMask3
c = ( j1 \\ kShift1 ) and kMask3
d = j1 and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
j1 = xr
a = ( j1 \\ kShift3 )
b = ( j1 \\ kShift2 ) and kMask3
c = ( j1 \\ kShift1 ) and kMask3
d = j1 and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
next inner
xr = xr xor myPPtr.UInt32( 17 * 4 )
d0 = xr
d1 = xl
myPPtr.UInt32( arrIndex ) = d0
arrIndex = arrIndex + 4
myPPtr.UInt32( arrIndex ) = d1
arrIndex = arrIndex + 4
next i
dim firstPPtr as UInt32 = myPPtr.UInt32( 0 )
arrIndex = 0
for i = 0 to 3
for k = 0 to 255 step 2
'self.Encipher( d0, d1 )
xl = d0
xr = d1
xl = xl xor firstPPtr
for inner = 1 to 16 step 2
j1 = xl
a = ( j1 \\ kShift3 )
b = ( j1 \\ kShift2 ) and kMask3
c = ( j1 \\ kShift1 ) and kMask3
d = j1 and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xr = xr xor ( j1 xor myPPtr.UInt32( inner * 4 ) )
j1 = xr
a = ( j1 \\ kShift3 )
b = ( j1 \\ kShift2 ) and kMask3
c = ( j1 \\ kShift1 ) and kMask3
d = j1 and kMask3
j1 = ( ( mySPtr.UInt32( a * 4 ) + mySPtr.UInt32( ( 256 + b ) * 4 ) ) _
xor mySPtr.UInt32( ( 512 + c ) * 4 ) ) _
+ mySPtr.UInt32( ( 768 + d ) * 4 )
xl = xl xor ( j1 xor myPPtr.UInt32( ( inner + 1 ) * 4 ) )
next inner
xr = xr xor myPPtr.UInt32( 17 * 4 )
d0 = xr
d1 = xl
mySPtr.UInt32( arrIndex ) = d0
arrIndex = arrIndex + 4
mySPtr.UInt32( arrIndex ) = d1
arrIndex = arrIndex + 4
next k
next i
next keyIndex
next rep
End Sub
With the same settings and aggressive compile, my Bcrypt code takes about 250 ms, whereas Bcrypt through PHP takes about 210 ms. I’m trying to find that other 40 ms.
Do you use Zend? Does PHP stay in memory all the time? When you checked the speed, had php cached or precompiled the script? Is PHP set up to use multiple cores?
What Im getting at is that youre probably not comparing apples to apples here.
I’m running it through the command line with a simple php -r '<script>', but you bring up an interesting point about caching issues as I’m just looping over the same password 20 times and taking the average. I’ll think about how to update that so it will have to calculate each time.
But even if there is speed increase on a 64 bit OS, is there a chance the code will need to run on a 32 bit OS? (maybe on Windows?) If so, do you care about that?