Optimizing the speed of a pixel processing algorithm

Alwyn_Bester · February 25, 2014, 7:13am

Can anyone spot a way to increase the processing speed of the following algorithm? I’m tapped out of ideas on how to do so.

Is there perhaps a way to get a Ptr or MemoryBlock of a Picture’s data? If not, I think a feature to do so will make a big difference when building speed critical image processing applications with Xojo.

Function ScalePicture(p as Picture, newWidth as integer, newHeight as Integer) As Picture
  // based on code from Dr. Gerard Hammond
  // with performance/functional improvements by Tomis Erwin
  // support for alpha channel and minor performance improvements by Alwyn Bester
  
  #if DebugBuild=False then
    #pragma BackgroundTasks false
    #pragma BoundsChecking false
    #pragma NilObjectChecking false
    #pragma StackOverflowChecking false
  #endif
  
  Dim pOut as Picture
  Dim s, sm As RGBSurface
  Dim o, om As RGBSurface
  Dim x,y,xMax, yMax As Integer
  Dim xx() as Double
  Dim c1, c2, c3, c4 As Color
  Dim xMult, yMult, a, b, xSub, ySub, xAdd, yAdd as Double
  Dim aPlusXAdd As Integer
  Dim bPlusYAdd As Integer
  Dim alpha As Integer
  
  s = p.RGBSurface
  sm = p.Mask.RGBSurface
  
  pOut= New Picture( newWidth, newHeight, 32 )
  
  o = pOut.RGBSurface
  om = pOut.Mask.RGBSurface
  
  xMax = pOut.Width - 1
  yMax = pOut.Height - 1
  
  yMult=p.Height / newHeight
  xMult=p.Width / newWidth
  
  a=newWidth/p.Width
  if a>.5 then
    xSub=.45
    xAdd=.5
  Elseif a<.5 then
    xSub=.75
    xAdd=2
  else
    xSub=0
    xAdd=1
  end
  
  a=newHeight/p.Height
  if a>.5 then
    ySub=.45
    yAdd=.5
  Elseif a<.5 then
    ySub=.75
    yAdd=2
  else
    ySub=0
    yAdd=1
  end
  
  Redim xx(xMax)
  
  for x = 0 to xMax
    xx(x)=(x * xMult) - xSub
  next x
  
  For y = 0 To yMax
    
    b = (y * yMult) - ySub
    bPlusYAdd = b + yAdd
    
    For x = 0 To xMax 
      a = xx(x)
      
      aPlusXAdd = a + xAdd
      
      c1 = s.Pixel(a       , b  )
      c2 = s.Pixel(aPlusXAdd , b )
      c3 = s.Pixel(a       , bPlusYAdd)
      c4 = s.Pixel(aPlusXAdd , bPlusYAdd)
      
      o.Pixel(x, y) = RGB( _
      (c1.Red + c2.Red + c3.Red + c4.Red) \\ 4, _
      (c1.Green + c2.Green + c3.Green + c4.Green) \\ 4, _
      (c1.Blue + c2.Blue + c3.Blue + c4.Blue) \\ 4 _
      )
      
      c1 = sm.Pixel(a       , b  )
      c2 = sm.Pixel(aPlusXAdd, b )
      c3 = sm.Pixel(a       , bPlusYAdd)
      c4 = sm.Pixel(aPlusXAdd, bPlusYAdd)
      
      alpha = (c1.Red + c2.Red + c3.Red + c4.Red) \\ 4
      
      om.Pixel(x, y) = RGB( alpha, alpha, alpha )
      
    Next x
    
  Next y
  
  Return pOut
End Function

Ulrich_Bogun · February 25, 2014, 12:27pm

You mean memoryblock = picture.getdata (format, quality)?

You can surely speed up things by dividing the picture into CPU-threadcores * slices and setting up so many multitasked sub-apps. Errr btw: Can Xojo use GPU features? All those doubles are much easier treated by a GPU.

Alwyn_Bester · February 25, 2014, 12:47pm

Yes, so that one can loop through the pixel data as bytes, avoiding a call to RGBSurface.Pixel(x, y) that returns an object for each pixel.

[quote=67613:@Ulrich Bogun]You mean memoryblock = picture.getdata (format, quality)?

You can surely speed up things by dividing the picture into CPU-threadcores * slices and setting up so many multitasked sub-apps. Errr btw: Can Xojo use GPU features? All those doubles are much easier treated by a GPU.[/quote]

Haven’t used threads before… guess this is a great opportunity for me to see how I could potentially use threads to speed up the processing. Will have a look into this.

Not sure about using GPU features. Probably doable with declares of some sort?

Ulrich_Bogun · February 25, 2014, 1:15pm

GPU: Quite certain done by declares and therefore beyond my scope.
Regarding the threads: Better check before if normal threads would do it far as I know the real parallel processing of Xojo threads is a bit limited , or if it would be better to set up different windowless apps like in the multiprocessing example.

Alwyn_Bester · February 25, 2014, 1:34pm

The problem with the current picture.GetData() is that it returns the data in a format such as JPG, PNG, BMP etc., and not in raw RGB bytes.

But perhaps I should have a look at the picture.getdata method again. Used together with Picture.FromData() it could be a solution.

I just wish there was a way to get a MemoryBlock of the Picture.RGBSurface object, so that one could manipulate the pixel byte data directly. This would really speed up things a lot.

Will_Shank · February 25, 2014, 1:42pm

Got your code setup to time a test image in a built app, runs at around 97-99 thousand microseconds.

Changing the setting of alpha from RGB() to a Color array runs around 95-97

[code]//before the xy loop
static greys(-1) As Color
if greys.Ubound < 255 then
redim greys(255)
for x = 0 to 255
greys(x) = RGB(x, x, x)
next
end

//in the loop switch this line
//om.Pixel(x, y) = RGB( alpha, alpha, alpha )
om.Pixel(x, y) = greys(alpha)[/code]

And I noticed a, b, aPlusXAdd and bPlusXadd are doubles. Copying those values to ints and using those vars where ints are expected runs around 87-90.

[code]dim ai, bi, aip, bip As integer
//…

For y = 0 To yMax

b = (y * yMult) - ySub
bPlusYAdd = b + yAdd
bi = b                  //copy to ints
bip = bPlusYAdd

For x = 0 To xMax
  a = xx(x)
  aPlusXAdd = a + xAdd
  ai = a                 //copy to ints
  aip = aPlusXAdd
  
  c1 = s.Pixel(ai       , bi  )  //use the ints
  c2 = s.Pixel(aip , bi )
  c3 = s.Pixel(ai       , bip)
  c4 = s.Pixel(aip , bip)
  //...
  c1 = sm.Pixel(ai       , bi  )
  c2 = sm.Pixel(aip, bi )
  c3 = sm.Pixel(ai       , bip)
  c4 = sm.Pixel(aip, bip)[/code]

Alwyn_Bester · February 25, 2014, 1:47pm

Excellent, thanks Will.

Julen_I · February 25, 2014, 1:53pm

Not declares, but a language that can use the framework that exposes the GPU. OpenCL is one such language. Maybe you can compile OpenCL code to a dll (or the corresponign MacOS and Liunx library format) that can be used in xojo via declares, I don’t know. But you can’t simply use the GPU via system declares.

[quote=67619:@Ulrich Bogun]Regarding the threads: Better check before if normal threads would do it far as I know the real parallel processing of Xojo threads is a bit limited , or if it would be better to set up different windowless apps like in the multiprocessing example.[/quote] Correct, if you want to use more than one core in Xojo you need to launch several applications and make them work in parallel. If you use the standard Xojo threads you will only be using one core, so no speed gain.

A few months back the was a Xojo blog post on this topic: http://www.xojo.com/blog/en/2013/07/take-advantage-of-your-multi-core-processor.php

Julen

Ulrich_Bogun · February 25, 2014, 2:05pm

I guess you can extract the image data if you remove the unnecessary tags. Have a look at TIFF image data is presented in Byte form, which should be what you are looking for, or am I wrong?

Alwyn_Bester · February 25, 2014, 2:09pm

I’ll first have to test how much overhead the Picture => TIFF (do stuff) TIFF => Picture causes. If the TIFF to picture and back conversions are fast enough, then that might be a possible way to increase the processing speed.

Ulrich_Bogun · February 25, 2014, 2:17pm

Sure. If it turns out to be helpful, you could either check other uncompressed image formats. BMP is much more simple; could very well be the conversion saves a few msecs. And it delivers the Image information in rows after a declared offset. If you skip that (can one easily move the lower border of a Memoryblock to cut away the offset? Pushing the ptr Offset bytes further?), conversion should be quite fast (if general BMP conversion is fast, of course).

Ulrich_Bogun · February 25, 2014, 2:24pm

[quote=67634:@Alwyn Bester]I still think that

MemoryBlock = Picture.RGBSurface.GetMemoryBlock()
would be first prize. I’m sure the RGBSurface is already stored internally as an array of bytes, so if one could just somehow get access to those bytes directly, it would be easy to design speedy algorithms for picture objects.[/quote]
Seems very possible, especially when you read the definition of an rgbsurface. If I would be savvy with handling pointers but I am sure someone else here is.

DaveS · February 25, 2014, 3:30pm

I have not analyzed the code in depth. but first thought… perhaps you can cache the C1-Cx values so as to not execute the PIXEL function so many times?

Kem_Tekinay · February 25, 2014, 4:36pm

So I don’t have to analyze the code, can someone tell me what that function does that’s different than the built-in scaling offered by Graphics.DrawPicture?

Ulrich_Bogun · February 25, 2014, 4:39pm

Have a look here, Ken: scale-quality-of-canvas-control

Ulrich_Bogun · February 25, 2014, 4:40pm

Sound have been Kem. Sorry!

Kem_Tekinay · February 25, 2014, 4:58pm

Darn autocorrect.

Thanks for the link, that clears it up.

Alwyn_Bester · February 25, 2014, 6:51pm

Will give a shot, and post the results once I’ve tested it.

Richard_Herd · February 25, 2014, 9:54pm

I think your easiest solution and what’ll probably give you the fastest routine will be processing your image as an OpenGL texture. There’s nearest neighbour and bilinear sampling buried in there. Also trilinear if you want to process a series of images.

It’d be handy if the GetData routines would give you an appropriate memory block. I haven’t looked at the tiff option, but it may be fairly easy. In any case you can make your own easily enough using MemoryBlock.ColorValue(offset,32) = rgbsurface.Pixel(x,y) and increment offset by 4 for each pixel you add in. I find if I need to access each pixel more than 2 or 3 times then it’s quicker putting the picture into a memoryblock, process that, and put it back into a picture at the end.

If you don’t want the overhead of the whole picture put into a memory block and know how many lines of the picture you need at a time (say applying a 3x3 convolution kernel), then just read the 3 lines of the picture into 3 separate memoryblocks, process them, transfer line 2 into the old line 3 (using memoryblock.stringvalue), transfer line 1 into line 2, and now read a new line from the image into line 1, process those three lines and so on. Much quicker than repeated calls of rgbsurface.pixel(x.y) to the same pixel.

Regards - Richard.

Sam_Rowlands · February 26, 2014, 1:52am

I actually found that RGBSurface is quicker than accessing each color channel of each pixel of a memory block. I can’t explain why, only that this is what I found in the past.