So I’ve got this data right, and it’s slash encoded, bit like a shellPath, except it’s not a shellpath. What’s the safest way of cleaning it up, obviously I don’t want to be accidentally removing the slashes that are were put there by the user.
I was thinking to replaceAll double slashes with something else and then simply replaceAll slashes with “” and then replace the something else with a single slash. Ideas anyone?
Something like this?
dim rx as new RegEx
rx.SearchPattern = "(?s)\\\\(.)"
rx.ReplacementPattern = "$1"
rx.Options.ReplaceAllMatches = true
s = rx.Replace( s )
Thanks Kem, I’ll give it a try a little later on… Too many other things to do today
Sam, I’ve used your trick often. I would find a character that’s surely not used in the text, e.g. Chr(1), and replace all “//” to Chr(1), then do the Replace for single slashes and in the end Chr(1) to “/”, just like you suggest.
Kem’s solution is possibly still not safe as it may not handle double slashes correctly (apart from the fact that he’s using backslashes ).
In what case might it fall?
I’d just iterate over the text and create a new string from the output. Conceptually it’d be this:
Function Unescape(myStr As String) As String
Dim output As String
For i As Integer = 1 To myStr.Len
Dim ch As String = myStr.Mid(i, 1)
If ch = "\" Then
Dim nextCh As String = myStr.Mid(i + 1, 1)
output = output + nextCh
i = i + 1
Else
output = output + ch
End If
Next
Return output
End Function
It can be optimized as needed.
I just tested a bunch of techniques for performance. If you only want pure Xojo code (no plug-ins), the ReplaceAll technique is fastest., but the fastest overall is a regular expression using RegExMBS.
Here is the code for each:
dim replacementCharCode as integer = 1
while s.InStr( chr( replacementCharCode ) ) <> 0
replacementCharCode = replacementCharCode + 1
wend
dim replacementChar as string = chr( replacementCharCode )
s = s.ReplaceAll( "\\\", replacementChar )
s = s.ReplaceAll( "\", "" )
s = s.ReplaceAll( replacementChar, "\" )
dim rx as new RegExMBS
rx.CompileOptionCaseLess = True
rx.CompileOptionDotAll = False
rx.CompileOptionUngreedy = False
rx.CompileOptionNewLineAnyCRLF = True
rx.ExecuteOptionNotEmpty = False
rx.CompileOptionMultiline = true
rx.CompileOptionNoUTF8Check = true
rx.CompileOptionUTF8 = true
call rx.Compile( "(?s)\\\\(.)" )
s = rx.ReplaceAll( s, "\\1" )
On a string that started with 34,000 characters peppered with slashes, the fastest, RegExMBS, took about 800 microsecs. The slowest, using two arrays and cycling through them, took about 16,000 microsecs. The replace code posted took about 2,500 microsecs.
Excellent… You guys are awesome. I’ve done some more digging into the document data format and figured out that it’s that same as C, so therefore I’ve found a conversion table.
The ironic thing is that the data is in UTF-16, so I don’t fully understand why they need to use a second encoding schema, maybe for backwards compatibility I guess.