making split escape character aware.

Brian_O_Brien · September 27, 2019, 9:40pm

I have a message that looks like
ABC|123|DEF|

That message uses the | as a separator.
So this would be:
ABC,123,DEF

However a special character \ is used to escape such that a | could be put in side two ||s

eg:

ABC|\|23|DEF|
would split into ABC,|23,DEF

So how do I make split aware of this special character?

DaveS · September 27, 2019, 11:47pm

sounds like a job for Superman! … uh… I mean Kem

Brian_O_Brien · September 28, 2019, 1:26am

Kem… That’s almost a regular expression around here!

Robert_Weaver · September 28, 2019, 2:39am

You will, of course, also have to watch out for the situation where the data contains an intentional \ character as data, which means that you’ll also have to escape it with a \ (hence \\). And so opens the can of worms.

Kem_Tekinay · September 28, 2019, 7:00pm

Assuming the “” will escape anything behind it, and the fields always ends with a bar, even the last one, and EOL’s are not a factor…

((?:\\\\.|[^|])*)\\|

The field itself will be in SubExpressionString( 1 ).

Douglas_Handy · September 28, 2019, 8:09pm

Of course, the next problem for the OP is the Split(), while seemingly very efficient, does not support using a RegEx leaving a self coded loop which is presumably less efficient on large strings than Split().

Whether or not that is significant to overall performance is likely to be data dependent.

Possibly more efficient is to still use the original Split() but then check array entries for “”, such as looping on IndexOf("") then adjusting the array contents.

DaveS · September 28, 2019, 8:14pm

s="ABC|\\|23|DEF|"
s=replaceall(s,"|",chr(9))  // replace all split char with a TAB
s=replaceall(s,"\"+chr(9),"\\|") // fix places where an "escaped" split should not have been changed
s=replaceall(s,"\\\","\") // unescaped escaped slashes
v=split(s,chr(9)) // split the string

brute force, but it should do the trick

Kem_Tekinay · September 28, 2019, 8:47pm

The problem with that approach is, what if the data contains this?

data\\\\|more data|

Unlikely, but that’s why regular expressions are preferable as they can be crafted to intelligently process the stream.

However, Douglas’ point is valid. If using the native RegEx on a sizable string, the performance will be abysmal. The solution is to use an alternative like the MBS plugin, or create a method around a MemoryBlock, which would be blazing.

DaveS · September 28, 2019, 8:54pm

move the last replace to become the first action… and it should work just fine

but I understand RegEx would be preferred, after all I was the one the suggested it

Robert_Weaver · September 28, 2019, 10:01pm

If the strings are very large then it’s still possible to use the split function to preprocess the data, but it can get a bit messy, and may not be worth the effort. Basically, you use the split function with the escape character as the delimiter so that you can skip large chunks of text that don’t have any escaping. If the resulting array has only one element, then there are no escaped characters, and you’re done. Otherwise, the first element won’t have any escaped characters, and the remaining elements will have to have the escape character concatenated back onto the front and then can be processed. The occurance of multiple adjacent \ characters will result in empty array elements which then require special processing (for example, ‘\\\\\|’). It’s not quite as bad as it may appear, because all of the processing involves only the first couple of characters of each array element. I’ve used this method in the past when processing very large chunks of text, but wouldn’t necessarily recommend it unless speed really does prove to be a problem.

Brian_O_Brien · September 29, 2019, 2:42am

Well I gave a stab at it and I think I have something that works. It’s weird but it works.
Given this example:
ABC|\|23|DEF|
would split into:
ABC,|23,DEF

However if I understood Kem then this might not work for |ABC\\|XXX|
Which should parse as ABC\,XXX

My thought was to find an unused character within the string and replace all “\|” with, “”+chr(unused) …

dim x as string = "ABC|\\|EF|GHI"

[code]Public Function split(extends s as string, sep as string, esc as string) as string()
dim tmp as string
dim lst(-1) as string

tmp = s.findUnused(sep)
s = s.replaceAll( esc+sep, esc+tmp)

lst = s.Split(sep)

dim i as integer
dim ts as string

for i=0 to lst.Ubound
ts = lst(i)
lst(i) = ts.replaceAll(tmp, sep)
next

End Function
[/code]

//This method tries to find an unused character to replace an instance of Separator with.

[code]Public Function findUnused(extends src as string, Separator as String) as string
dim anArray(255) as Boolean // 0 to 255 = 256 = 2^8
dim c as uint8
dim idx as integer
dim s as string

//Make an array [0,1,2,3…255]
for idx=0 to 255
anArray(idx) = false
next

// The Separator itself must count.
anArray(asc(Separator)) = true

//Find 1st false
idx = 0
while anArray(idx) and idx <= 255
idx = idx + 1
wend
if idx <> 256 then
return(chr(idx))
end if
return “” //Inescapable.

End Function
[/code]

Robert_Weaver · September 29, 2019, 10:50am

The problem with the findUnused function is that it steps through your entire input text character by character. If you are concerned about speed with large input strings, then this will be a bottleneck. Rather than stepping through the input text, it would be faster to step through the characters starting at one and seeing if it’s in the input text using the instr function like so:

for i = 1 to 127
  if instr(src,chr(i))=0 then
     'Found an unused character, so exit
     return chr(i)
  end if
next
return ""

Notice that the code doesn’t attempt to check non ASCII codepoints 128…255 because these could cause problems with encoding, and are best avoided.

However, you still have the problem that your main routine will fail with Kem’s |ABC\\|XXX| example string.

I’ve got some code that uses the split function, first on the escape character and then on the delimiter character to sort out this kind of escaping. It avoids the problem of changing one special character into another (ie., changing one problem into another). I’ll dig it out and post it later.

Russ_Lunn · September 29, 2019, 1:57pm

Are the ascii 28-31 not perfect for this kind of thing?

Robert_Weaver · September 30, 2019, 12:20am

Those should work.
I’ve used the replacement character method in the past, but I try to avoid it if possible, because it always seems to come back and bite me when I make a program change.

I found that using the split function to break the text at escape character locations allows the program to skip over large chunks of text, and avoid slow character by character processing. The escaped characters can be handled and then the resulting text can be split with the true delimiter to finish the processing. Replacement of characters is not required, because the text is processed directly. The following is adapted from one of my projects. Hopefully, I didn’t break anything when I edited it.

Public Function SplitEsc(rawTxt As String, delimiter As String, escape As String) as String() 'Step 1: Use the split function to locate escape characters, if any. dim txtChunk() As String = split(rawTxt,escape) 'fix null string case where Split() creates an empty array if txtChunk.Ubound<0 then txtChunk.Append("") if txtChunk.Ubound=0 then 'No escape characters in the text, so we are done return Split(txtChunk(0),delimiter) end if 'Process the first chunk of text which ends just before the first escape character dim txtOut() As string = Split(txtChunk(0),delimiter) if txtOut.Ubound<0 then txtOut.Append("") dim parity As Integer = 0 'This keeps track of multiple consecutive escape characters 'Process each subsequent escape character for i as Integer = 1 to txtChunk.Ubound 'Handle escaped character if txtChunk(i)="" then 'This is an escaped escape character parity = if(parity=0,1,-parity) if parity=1 and i<txtChunk.Ubound then 'Append the escape char unless this is the last element in the array txtOut(txtOut.Ubound)=txtOut(txtOut.Ubound)+escape end if Else dim bIndex As Integer = 0 If parity<1 and Left(txtChunk(i),1)=delimiter then 'This is an escaped delimiter character and must be appended to the output txtOut(txtOut.Ubound)=txtOut(txtOut.Ubound)+delimiter bIndex = 1 'If any other special excaped characters need to be handled, 'their code should be placed here in ElseIf sections. end If parity = 0 'Step 2: Now split the delimited text and append to the output array dim dataField() As String = Split(txtChunk(i),delimiter) if dataField.Ubound<0 then dataField.Append("") for j as Integer = bIndex to dataField.Ubound-1 txtOut(txtOut.Ubound) = txtOut(txtOut.Ubound)+dataField(j) txtOut.Append("") next txtOut(txtOut.Ubound) = txtOut(txtOut.Ubound)+dataField(dataField.Ubound) end if next return txtOut End Function

This correctly handles the |ABC\\|XXX| problem text.

Kem_Tekinay · September 30, 2019, 2:37pm

Since I mentioned MemoryBlocks, and I’ll be giving a talk on those at Xojo.Connect 2020 , here is some code that uses MemoryBlocks with a pre-dimmed array. For 20k fields, this takes about 15 ms here.

Public Function SplitByDelimiter(s As String, delimiter As String, escapeChar As String = "\") as String()
  s = s.ConvertEncoding( Encodings.UTF8 )
  delimiter = delimiter.ConvertEncoding( Encodings.UTF8 )
  escapeChar = escapeChar.ConvertEncoding( Encodings.UTF8 )
  
  if delimiter.LenB <> 1 or escapeChar.LenB > 1 then
    dim err as new RuntimeException
    err.Message = "Improper delimiter or escape character"
    raise err
  end if
  
  dim mbIn as MemoryBlock = s
  dim pIn as Ptr = mbIn
  
  dim mbOut as new MemoryBlock( mbIn.Size )
  dim pOut as Ptr = mbOut
  
  dim delimCode as integer = delimiter.Asc
  dim hasEscape as boolean = escapeChar <> ""
  dim escapeCode as integer
  if hasEscape then
    escapeCode = escapeChar.Asc
  end if
  
  dim fieldLen as integer
  dim byteIndex as integer
  dim addThisChar as boolean
  dim fields( 1000 ) as string
  dim fieldUb as integer = -1
  
  while byteIndex < mbIn.Size
    dim thisByte as integer = pIn.Byte( byteIndex )
    
    if addThisChar then
      pOut.Byte( fieldLen ) = thisByte
      fieldLen = fieldLen + 1
      addThisChar = false
      
    elseif hasEscape and thisByte = escapeCode then
      addThisChar = true
      
    elseif thisByte = delimCode then
      fieldUb = fieldUb + 1
      if fields.Ubound < fieldUb then
        redim fields( fieldUb * 2 )
      end if
      
      if fieldLen <> 0 then
        fields( fieldUb ) = mbOut.StringValue( 0, fieldLen ).DefineEncoding( Encodings.UTF8 )
        fieldLen = 0
      else
        fields( fieldUb ) = ""
      end if
      
    else
      pOut.Byte( fieldLen ) = thisByte
      fieldLen = fieldLen + 1
      
    end if
    
    byteIndex = byteIndex + 1
  wend
  
  redim fields( fieldUb )
  
  if fieldLen <> 0 then
    fields.Append mbOut.StringValue( 0, fieldLen ).DefineEncoding( Encodings.UTF8 )
  end if
  
  return fields
End Function

Brian_O_Brien · September 30, 2019, 7:45pm

Thanks a million as always… always more help than I know what to do with!!!