Slash encoding, safest way to handle it?

Sam_Rowlands · August 5, 2015, 2:48am

So I’ve got this data right, and it’s slash encoded, bit like a shellPath, except it’s not a shellpath. What’s the safest way of cleaning it up, obviously I don’t want to be accidentally removing the slashes that are were put there by the user.

I was thinking to replaceAll double slashes with something else and then simply replaceAll slashes with “” and then replace the something else with a single slash. Ideas anyone?

Kem_Tekinay · August 5, 2015, 2:57am

Regular expression?

Kem_Tekinay · August 5, 2015, 3:16am

Something like this?

dim rx as new RegEx
rx.SearchPattern = "(?s)\\\\(.)"
rx.ReplacementPattern = "$1"
rx.Options.ReplaceAllMatches = true

s = rx.Replace( s )

Sam_Rowlands · August 5, 2015, 8:15am

Thanks Kem, I’ll give it a try a little later on… Too many other things to do today

Thomas_Tempelmann · August 5, 2015, 9:06am

Sam, I’ve used your trick often. I would find a character that’s surely not used in the text, e.g. Chr(1), and replace all “//” to Chr(1), then do the Replace for single slashes and in the end Chr(1) to “/”, just like you suggest.
Kem’s solution is possibly still not safe as it may not handle double slashes correctly (apart from the fact that he’s using backslashes ).

Kem_Tekinay · August 5, 2015, 2:27pm

In what case might it fall?

Joe_Ranieri · August 5, 2015, 2:35pm

I’d just iterate over the text and create a new string from the output. Conceptually it’d be this:

Function Unescape(myStr As String) As String
  Dim output As String
  For i As Integer = 1 To myStr.Len
    Dim ch As String = myStr.Mid(i, 1)
    If ch = "\" Then
      Dim nextCh As String = myStr.Mid(i + 1, 1)
      output = output + nextCh
      i = i + 1
    Else
      output = output + ch
    End If
  Next
  Return output
End Function

It can be optimized as needed.

Kem_Tekinay · August 5, 2015, 4:14pm

I just tested a bunch of techniques for performance. If you only want pure Xojo code (no plug-ins), the ReplaceAll technique is fastest., but the fastest overall is a regular expression using RegExMBS.

Here is the code for each:

  dim replacementCharCode as integer = 1
  while s.InStr( chr( replacementCharCode ) ) <> 0
    replacementCharCode = replacementCharCode + 1
  wend
  
  dim replacementChar as string = chr( replacementCharCode )
  
  s = s.ReplaceAll( "\\\", replacementChar )
  s = s.ReplaceAll( "\", "" )
  s = s.ReplaceAll( replacementChar, "\" )

  dim rx as new RegExMBS
  
  rx.CompileOptionCaseLess = True
  rx.CompileOptionDotAll = False
  rx.CompileOptionUngreedy = False
  rx.CompileOptionNewLineAnyCRLF = True
  rx.ExecuteOptionNotEmpty = False
  rx.CompileOptionMultiline = true
  rx.CompileOptionNoUTF8Check = true
  rx.CompileOptionUTF8 = true
  
  call rx.Compile( "(?s)\\\\(.)" )
  s = rx.ReplaceAll( s, "\\1" )

On a string that started with 34,000 characters peppered with slashes, the fastest, RegExMBS, took about 800 microsecs. The slowest, using two arrays and cycling through them, took about 16,000 microsecs. The replace code posted took about 2,500 microsecs.

Sam_Rowlands · August 6, 2015, 2:27am

Excellent… You guys are awesome. I’ve done some more digging into the document data format and figured out that it’s that same as C, so therefore I’ve found a conversion table.

The ironic thing is that the data is in UTF-16, so I don’t fully understand why they need to use a second encoding schema, maybe for backwards compatibility I guess.