Large memoryblocks under 64 bit

Sam_Rowlands · August 7, 2018, 1:34am

Change it to

mb.uint8Value( j ) = 127

It still takes forever, but it doesn’t crash and I can see the application actually showing as using 5 GB of RAM.

Update; with the below changes I can get it to fill 5GB of memory in 22 seconds (a 2012 rMBP with 16GB of RAM). Still far off 10 seconds for the C plugin.

[code]#pragma BackgroundTasks Off
#pragma BoundsChecking Off
#pragma BreakOnExceptions Off
#pragma NilObjectChecking Off
#pragma StackOverflowChecking Off

Dim mb as new Xojo.Core.MutableMemoryBlock( 5e9 )
Dim tSize as Uint64 = mb.size
dim p as Ptr = mb.Data

// fastFillPtr8(p, 127, tSize)
// b = checkPtr8values(p, 127, tSize)

Dim methodTimer as double = microseconds
system.DebugLog currentMethodName + " started, using " + formatBytes( tSize )

Dim n as Uint64 = tSize -1

Dim tp as new memoryBlock( 8 )
For l as integer = 0 to 7
tp.Uint8Value( l ) = 127
Next

for j as Uint64 = 0 to n step 16
mb.Uint64Value( j ) = tp.Uint64Value( 0 )
mb.Uint64Value( j + 8 ) = tp.Uint64Value( 0 )
next

system.DebugLog currentMethodName + " completed in " + format( ( microseconds - methodtimer ) / 1000000, “###,###,###.000” ) + " seconds."[/code]

Sam_Rowlands · August 7, 2018, 2:00am

How about 5GB in 4.2 seconds?

[code]#pragma BackgroundTasks Off
#pragma BoundsChecking Off
#pragma BreakOnExceptions Off
#pragma NilObjectChecking Off
#pragma StackOverflowChecking Off

'Dim mb as new Xojo.Core.MutableMemoryBlock( ( 1000 * 1000 ) * 100 )
Dim mb as new Xojo.Core.MutableMemoryBlock( 5e9 )
Dim tSize as Uint64 = mb.size
dim p as Ptr = mb.Data

// fastFillPtr8(p, 127, tSize)
// b = checkPtr8values(p, 127, tSize)

Dim methodTimer as double = microseconds
system.DebugLog currentMethodName + " started, using " + formatBytes( tSize )

Dim n as Uint64 = tSize -1

Dim tp as new xojo.core.MutableMemoryBlock( 128 )
For l as integer = 0 to 127
tp.Uint8Value( l ) = 127
Next

for j as Uint64 = 0 to n step 256
mb.mid( j, 128 ) = tp
mb.mid( j + 128, 128 ) = tp
next

system.DebugLog currentMethodName + " completed in " + format( ( microseconds - methodtimer ) / 1000000, “###,###,###.000” ) + " seconds."[/code]

Peter_Stys · August 7, 2018, 3:49am

errr, that was < 0.1 sec for the C plugin

Regardless, very clever, I wonder how much faster my plugin would be if I too moved 64 bits at a time… (plus in real use the values won’t all be the same as I’m sure you surmise). The C code follows the same syntax as the xojo loop, so I wonder where such significant inefficiencies creep in.

No matter. Point is, I always use Ptr’s to memblocks as access is much faster, and you show that addressing the MutableMemoryBlock directly with Uint8Value does not crash, so there is something amiss when a Ptr is returned.

I trust Greg will take note as part of this case # and Xojo will get this fixed.

Cheers,
P.

Mike_D · August 7, 2018, 3:51am

Pretty cool. Your CPU must be faster than mine - your code runs in 5.9 seconds on my machine.

Here’s a version which runs in 4.2 seconds (about 30% faster) - basically it’s the same as yours but with a larger chunk size:

#pragma BackgroundTasks Off
#pragma BoundsChecking Off
#pragma BreakOnExceptions Off
#pragma NilObjectChecking Off
#pragma StackOverflowChecking Off

Dim mb as new Xojo.Core.MutableMemoryBlock( 5e9 )
Dim tSize as Uint64 = mb.size
dim p as Ptr = mb.Data

Dim methodTimer as double = microseconds
'system.DebugLog currentMethodName + " started, using " + formatBytes( tSize )

Dim n as Uint64 = tSize -1
const kChunkSize = 128*1024
Dim tmp as new Xojo.Core.MutableMemoryBlock( kChunkSize )
For l as integer = 0 to kChunkSize - 1
  tmp.Uint8Value( l ) = 127
Next

for j as Uint64 = 0 to n step kChunkSize
  dim L as uint64 = min(kChunkSize, n-j)
  mb.mid(j, L) = tmp
next

msgbox currentMethodName + " completed in " + format( ( microseconds - methodtimer ) / 1000000, "###,###,###.000" ) + " seconds."

Mike_D · August 7, 2018, 3:55am

I think the theoretical max memory bandwidth of a core i7 is around 20-40GB/second, so filling 2GB in 0.1 seconds is theoretically possible - is your C plugin highly optimized?

Peter_Stys · August 7, 2018, 4:55am

If it is, it’s optimized by the compiler [-Ofast flag], not me, my code is pretty simple:

[code]static void fastFillPtr8 (Ptr destPtr, long value, long nUInt8s)
{ UInt8 *destPtr8;
long n, limit;
UInt8 localValue;

if (destPtr == NULL)	return;
if ( (value<0) or (value>255) )	return;
destPtr8 = (UInt8*) destPtr;
localValue = (UInt8) value;
limit = nUInt8s;

for (n=0; n<limit; n++)
{	destPtr8[n] = localValue;
}

return;

}
REALmethodDefinition fastFillPtr8Defn = {(REALproc) fastFillPtr8,REALnoImplementation,“fastFillPtr8(destPtr as Ptr, value as integer, nUInt8s as integer)”};[/code]

I copy the fn arguments to local variables and assume they are then placed in registers by the compiler, plus maybe some loop unrolling or whatever other optimizations are possible under LLVM 9.0. Bit isn’t Xojo also LLVM nowadays?

Mike_D · August 7, 2018, 2:34pm

That’s a good point, I forgot that I was compiling with optimization set to “Default”. But changing it to Aggressive and re-building did not make a change. This is not surprising, since I would bet that 99% of the time spent is not in Xojo but in the memoryBlock.mid() function, which is probably macOS library code.

Sam_Rowlands · August 8, 2018, 2:20am

I concur, and I would like to thank you for illustrating this to me, in a quick test I did yesterday, I managed to get a routine down from 2.7 seconds to 0.7 seconds, with some loop unrolling and the aggressive compiler, it came down to 0.2 seconds (the routine was processing a 4K image).

It’s still 10 slower than using Core Image, however it will vastly improve some of my pixel analysis routines and also speed up building of cluts, which hopefully means I can create the full clut in a slider valueChanged event rather than a cut down version.

p.s. Core Image is fantastic for altering pixels, but terrible for pixel analysis.

So thanks for sharing that tidbit Peter.

for j as Uint64 = 0 to n step kChunkSize mb.mid(j, min(kChunkSize, n-j) ) = tmp next

By doing the math directly inline, in theory you should save a write to memory and then a read from memory. Ideally if you want to unroll the function, you’re going to have to use fixed lengths.

Apple’s vImage, will only work with images who’s sizes are in multiples of 4. I assume so that it can use fixed sizes.

Mike_D · August 8, 2018, 2:30pm

I tried this, and it seems to make no noticeable difference - I suspect a good optimizing compiler would know that the variable “L” was only used for the function call and optimize it to being stored in a CPU register.

DaveS · August 8, 2018, 2:58pm

Would it not be faster to use UInt64 and 0x7f7f7f7f7f (or whatever the correct value would be) instead?

Mike_D · August 8, 2018, 3:09pm

The technique we are using first fills a medium-sized memoryBlock with 7F and then repeatedly copies this memoryBlock to the big memoryBlock. Yes, it is should be faster to fill the medium-sized memoryBlock with UInt64s rather than Uint8s.

But, I think most of the time is actually spent in the second step, so optimizing the first step doesn’t make a big difference.

Some additional ideas:

would it be faster to fill the big memory block from the end to the beginning? My logic: if you start at the end, then the memory manager definitely has to create the full 5GB size on the first loop, and should never have to resize the memoryBlock after that point. If you start at the beginning however, it could be that the memory manager is constantly resizing the memoryBlock which could be very slow. I have no evidence of this however.

would it be faster to fill the big memory block by copying from itself, rather than the small memory block? Perhaps the “most recently copied bytes” would still be in a cache and faster to access?

I wonder if the .mid() function is optimized or may perhaps do a memory copy? If so, this should be faster if we could rewrite it using Ptrs -( but that’s the point of this bug report that Ptrs are broken)

Mike_D · August 8, 2018, 3:19pm

[quote=399779:@Michael Diehr]
would it be faster to fill the big memory block from the end to the beginning? My logic: if you start at the end, then the memory manager definitely has to create the full 5GB size on the first loop, and should never have to resize the memoryBlock after that point. If you start at the beginning however, it could be that the memory manager is constantly resizing the memoryBlock which could be very slow. I have no evidence of this however.[/quote]

I just tried this, and it makes no difference.

Also, sampling the code while running shows:

97.5% Xojo.Core.MutableMemoryBlock.=Mid%%o<Xojo.Core.MutableMemoryBlock>u8u8o<Xojo.Core.MemoryBlock>  (in memfill-aggressive) + 45  [0x1023269ed]
97.4%  _platform_memmove$VARIANT$Ivybridge  (in libsystem_platform.dylib) + 49,52  [0x7fffe45b6fd1,0x7fffe45b6fd4]

which suggests almost the entirely of what’s happening is in the _platform_memmove() function

Ulrich_Bogun · August 8, 2018, 4:01pm

I wonder if LLVM optimizations in Xcode would mean automatic usage of the simd library for parallel filling of the memoryblock?
(On the other hand, I am pretty sure using them via declares would boost the Xojo code by numbers.)

I dont think so. When you dim a memoryblock for a certain size, exactly that chunk of memory is reserved (if the system can do so). Moving a several GB sized MB in memory for resizing purposes would be incredibly time consuming, very much like adding something to a long string repeatedly instead of using Join.