UTF16 to UTF8

Hello,

I have a UTF16 log file that I have to convert to UTF8. I have a routine that I found here that works but after the log file reaches 700k+ it really bogs down and becomes super slow. I was wondering if there was a better/faster method then this? Any help would be appreciated. Here is the code:

[code]
// Turn the given string into valid UTF-8.
// Filters out invalid characters, so it might return an empty string.

if src.Encoding = nil then
src = src.DefineEncoding( Encodings.UTF8 )
elseif src.Encoding <> Encodings.UTF8 then
src = src.ConvertEncoding( Encodings.UTF8 )
end if

if src = “” or Encodings.UTF8.IsValidData( src ) then
return src
end if

// If we get here, we have a non-empty string defined as UTF8, but it’s not valid.
// We have to remove the invalid bytes.

dim mb as MemoryBlock = src
dim p as Ptr = mb
dim lastIndex as integer = mb.Size - 1
dim writeIndex as integer
dim readIndex as integer
while readIndex <= lastIndex
dim thisByte as integer = p.Byte( readIndex )
if thisByte <= &b01111111 then
p.Byte( writeIndex ) = thisByte
readIndex = readIndex + 1
writeIndex = writeIndex + 1

elseif thisByte >= &b11111110 then // Invalid byte
readIndex = readIndex + 1

else // It’s a leading byte so figure out how many valid bytes should be in the group and check them
dim byteCount as integer
if thisByte >= &b11111100 then
byteCount = 6
elseif thisByte >= &b11111000 then
byteCount = 5
elseif thisByte >= &b11110000 then
byteCount = 4
elseif thisByte >= &b11100000 then
byteCount = 3
elseif thisByte >= &b11000000 then
byteCount = 2
else // This is an invalid byte so filter it out
readIndex = readIndex + 1
continue while // Skip to the next byte immediately
end if

// Make sure we have enough bytes to make a complete character. If not, filter this out.
if ( readIndex + byteCount - 1 ) > lastIndex then
  readIndex = readIndex + 1
  continue while // Skip to the next byte immediately
end if

dim chunk as string = mb.StringValue( readIndex, byteCount )
if Encodings.UTF8.IsValidData( chunk ) then
  mb.StringValue( writeIndex, byteCount ) = chunk
  readIndex = readIndex + byteCount
  writeIndex = writeIndex + byteCount
else // This can't be a leading byte so let's discard it
  readIndex = readIndex + 1
end if

end if
wend

dim r as string
if writeIndex <> 0 then
r = mb.StringValue( 0, writeIndex )
r = r.DefineEncoding( Encodings.UTF8 )
end if

return r[/code]

I may be out in left field here, but is this not where you would use ConvertEncoding? That would be easier and possibly faster from what I understand.

Hi Louis,

I think that is this correct?:

[code]if src.Encoding = nil then
src = src.DefineEncoding( Encodings.UTF8 )
elseif src.Encoding <> Encodings.UTF8 then
src = src.ConvertEncoding( Encodings.UTF8 )
end if

if src = “” or Encodings.UTF8.IsValidData( src ) then
return src
end if[/code]

I just double checked to see if it was not converting it via ConvertEncodings and it was 100% of the time. so I am not sure why its slowing down as it is returning valid UTF8 with this line: src = src.ConvertEncoding( Encodings.UTF8 )

I think using .ConvertEncodings is probably the best idea, however your code can be sped up:

  • You are already using Ptr which is good
  • You should add
#pragma DisableBackgroundTasks
#BoundsChecking False

since this is a very tight loop.

  • My guess is that Encodings.UTF8.IsValidData() is an expensive call - you can probably speed it up a little by caching it outside the loop:
dim enc as TextEncoding = Encodings.UTF8
[...]

while readindex < lastIndex
   if enc.IsValidData