Optimizing a bitwise loop

Hi Everyone,

I wonder if anyone can provide suggestions on a bit of code optimization? I have a bunch of memoryblocks totaling around 15MB in size. What I’m trying to do is build up a matrix of the co-occurrence of 12-bit numbers within about ~80bytes of each other in this set of memoryblocks.

To extract the 12-bit numbers from the memoryblock I’m basically pulling out a UInt64 then shifting right then left with bitwise.shift to
remove the bits either side of the set of 12 bits I want to grab. As bitshifting can be done very quickly on modern CPUs I’d hoped
this whole process would take just a few seconds. However it’s taking about 2 minutes (single threaded) at a rate of about 1 million loops/s or ~2400 CPU cycles per loop. This doesn’t sound like very much time, but I have about 100,000 of these 15MB sets to work through and unless I can speed up I’m going to have to maybe look to do this in C.

Any advice on speeding things up in Xojo? I’d considered moving everything inside the loop into a C plugin but wouldn’t want to go that way unless absolutely necessary as mixed code is messy. The rough code for the loop is below (note that this checks two sets of memoryblocks - one acts as a mask for regions not to include in the matrix)…

dim t1,t2,zero as uint64
dim realC(4095,4095,39) as UInt16

for x=1 to z
mtemp=new MemoryBlock(j)
mmask=new MemoryBlock(j)
j=j-10 'masked buffer at the end of the memoryblock to stop UInt64 overflow
k=j8-24 'bit positions
[b]for i=0 to k step 2
if Bitwise.ShiftRight(Bitwise.ShiftLeft(mmask.UInt64Value(l),i-l),52)=zero then
for o=m to n step 2
if Bitwise.ShiftRight(Bitwise.ShiftLeft(mmask.UInt64Value(l2),o-l2),52)=zero then
end if
end if

Many thanks for any time-saving ideas!


#pragma disableBackgrondTasks
will help.

I bet I can make it 100 times faster in C, if you’d order a plugin function to do it.
(email me if you are interested)

Thanks Christian - yup, I have that pragma and disable bounds checking set. I was really hoping to stay in Xojo here, but I guess bitwise has a function call overhead that’s going to be very hard to optimize even though the CPU’s bit operations are inherently very fast. I’ll email you offlist if I have trouble and need a plugin written (now where was that add5 example plugin code again :slight_smile:

Are you in Las Vegas? We could even write it together…

The Bitwise calls are functions and involve the normal method overhead. Unfortunately, there are no equivalent operators as there are for And, Or, and Xor.

What might help a little is to use Ptr to the MemoryBlock since MemoryBlock functions involve method overhead too.

Always be careful with Ptr! If you’re code is faulty, you will overwrite memory.

You code would change a little like this:

mtemp=new MemoryBlock(j)
mmask=new MemoryBlock(j)

dim ptemp as Ptr = mtemp
dim pmask as Ptr = mmask
if Bitwise.ShiftRight(Bitwise.ShiftLeft(pmask.UInt64Value(l),i-l),52)=zero then

Given your code, I wouldn’t expect this to be a dramatic difference.

@Kem Tekinay … many thanks for the tip Kem. I’ll take a look at the speed effects of the ptr-based function.

@Christian Schmitz … sadly I’m not going to be out of cold wet Montreal anytime soon.

You should precalculate the mask and use AND to isolate the bits you want, instead of using Shift left/right.

Hi Tim, thanks for your input! I can certainly create an array of masks for each 12-bit offset within the 64bit word. But then I think I still need to shift to get the 12-bits I’ve AND-masked down into the lowest 12 bits of the word? Unless you have a quicker way?

What I was just wondering about was using simple multiplication and division to slide the bits rather than explicitly bitshifting ie multiplying by pow(2,shift) or dividing by pow(2,shift). It would probably trigger some form of overflow or underflow exception but might still be quicker than the function overhead…

Christian has a good point though. Any REAL/Xojo programmer that wants MORE out of REAL/Xojo should either have a utility plugin or DLL/dylib written in C++ to provide library functions like Adrian wants; something like this will just fly in C/C++.

So, every time you want something to go faster or if you want some type of functionality that just doesn’t sit right in REAL/Xojo, do it in C/C++ in your plugin/DLL/dylib and you got it.

Kem and I worked out something very similar in another thread about finding dna matches. That project was finding 950 “15-mers” in 100Mb of dna. But the dna was 1 nucleotide per byte which makes 12.5M nucleotides, so about equivalent to your 15MB. Anyways, it got down to 11.5 seconds to do the whole scan.


The basic idea is to window the mask into a dictionary. Scan a byte at a time using the dictionary to find matches.

For the record, when Will includes me, he is being kind. That code was all him, and remarkably good work.

Thanks Will, Kem - yes, you’re spot on with what I’m looking for here!

What I’m trying to determine is an adjacency matrix of all possible hexamers in various conserved genomic elements relative to thousands of randomly permuted backgrounds. I started with string hunting but since I want all possible hexamers, it turns out to be easier I think to just catalog the relative positions of all 4096x4096 elements relative to each other in 40nt windows. I was trying to keep this in integer math for speed and it looks like we’ve converged on some similar solutions. I’m going to go take a look at your Masking approach as this looks like a good way forward! Thanks!