An odd shell result / regex failure

Scott_Griffitts · November 26, 2022, 5:47pm

I’m reading a shell result and running a regex over it to extract values. It has worked on hundreds of files but the regex can’t find anything in the result for this one file. I’ve copied the result out of the debugger and put it in BBEdit and run the same regex patterns and it finds everything. I’ve tried converting the encoding to UTF8 and replacing line endings but still nothing. I finally just tried a simple

rg.searchpattern = "(a)"

and it still doesn’t match anything. It seems like the the shell result for this one particular file is causing regex to not match anything. Any ideas what could cause that?

Kem_Tekinay · November 26, 2022, 7:24pm

Not without seeing the data from the shell.

Capture the output directly to a file and post it here?

Scott_Griffitts · November 26, 2022, 7:39pm

The regex appears to be failing because there’s an ascii char 65533 in the string. Interestingly, BBEdit ignores it. Now I’m trying to figure out how to get rid of it. This didn’t work:

ReplaceAll(chr(65533), "")

Scott_Griffitts · November 26, 2022, 9:01pm

I split the string and added the individual asc numbers to an array and didn’t end up with a 65533. But i did get a 1835008, which I’m guessing is two 65533 chars together, maybe? But this also doesn’t work:

Replaceall(chr(1835008), "")

So I worked around all that by just going through an array of the characters and removing anything with an asc greater than 60000. Just a guess at that limit. Hopefully there’s nothing usable over 60000.

arr = ret.Split("")
li = arr.LastIndex
for i = li downto 0
  if asc(arr(i)) > 60000 then
    arr.RemoveAt(i)
  end if
next
ret = String.FromArray(arr, "")

This works, at least for this problematic file, but I’d be curious about the best way to clean a string.

TimStreater · November 26, 2022, 10:09pm

Decimal 65533 is hex FFFD which is the UTF-8 replacement character. Your string back from the shell clearly contains some rubbish. Perhaps you need to define the encoding of the string returned by the shell. I don’t know whether that would be ASCII or UTF-8, others may know or perhaps the doc for the shell command will tell you.

ASCII, BTW, goes from 0 to 127 decimal.

Scott_Griffitts · November 26, 2022, 10:40pm

I had not tried defining encoding as ASCII. I still get the bad characters, but these don’t break regex. In this case, the characters represent some non-English comments added by the shell command in what appears to be relatively rare circumstances. These comments aren’t needed in my case so I can remove/ignore them. The command documentation only focuses on input.

Should Xojo’s regex break with these UTF-8 replacement characters? BBEdit’s regex doesn’t seem to care.

Greg_O · November 28, 2022, 11:03pm

I’d bet the file is set to UTF-8 in BBEdit though. If you don’t set it in Xojo, it’ll be nil and reflex will have problems.

Scott_Griffitts · November 28, 2022, 11:10pm

The regex is searching shell output. To test in BBEdit, I’m copying the string out of the debugger and pasting in BBEdit. Both BBEdit and the debugger say the string is UTF-8. But, with these characters in the string, Xojo regex stops working, BBEdit regex keeps going.

Sascha_S · November 29, 2022, 11:23am

Because your OS is assuming it’s encoding.

Sascha_S · November 29, 2022, 11:38am

@Scott_Griffitts have you tried it with something like

  String = TextInputStream.Open(FolderItem)
  String.Encoding = Encodings.UTF8

or

String = TextInputStream.ReadAll(Encodings.UTF8)

Scott_Griffitts · November 29, 2022, 5:48pm

There is no textiputstream. This is a shell command result.

sh.Execute(cmd)
ret = sh.Result.DefineEncoding(Encodings.UTF8)

This will give me a string which includes characters like this, which I don’t know will survive being pasted in the forum:

Language ‘en’ not found, using ‘��’ instead
Languages available: ��

Because of these characters, Xojo’s regex will not match anything. Something as simple as this pattern on this string will not work:

rg.searchpattern = "not"

If I copy the string out of the debugger and paste it into BBEdit, it’s regex will happily search and match.

These characters generated by the shell are not something I need. The rest of the result contains valid information. So I work around them with:

arr = ret.Split("")
li = arr.LastIndex
for i = li downto 0
  if asc(arr(i)) > 60000 then
    arr.RemoveAt(i)
  end if
next
ret = String.FromArray(arr, "")

The 60000 number is arbitrary and I could probably get away with just looking for 1835008.

The main issue at this point, for me at least, is that Xojo’s regex breaks on a string that contains characters it apparently thinks are invalid. Is this a bug or by design? As BBEdit has shown, it is possible to search this string.

Greg_O · November 29, 2022, 6:20pm

Ok, so those characters with the ? inside them are an indication that the encoding you chose is incorrect. Put a breakpoint in and check the encoding of that string before you call DefineEncoding and see what it is.

Just out of curiosity, what platform are you running your app on?

Scott_Griffitts · November 29, 2022, 6:25pm

sh.Execute(cmd)
ret = sh.Result
break

Debugger says Encoding: UTF-8

This is on a Mac.

Tim_Parnell · November 29, 2022, 6:36pm

Are you able to share the command you’re running so someone could reproduce your results? It’s possible there’s an explanation for why you’re getting these characters.

Scott_Griffitts · November 29, 2022, 6:51pm

It would be hard to replicate but it’s basically the Handbrake CLI which is in turn referencing libdvdnav (might be some mishandled encoding between the two there) and then you’d need to give that an .iso file that just happens to trigger the complaints about missing languages. The result has all the necessary info, I just need to jump through some hoops to get regex to parse it.

Tim_Hare · November 29, 2022, 7:45pm

It is highly likely that the text is not UTF8. Probably MacRoman. The debugger is making an assumption here and is guessing wrong.

Scott_Griffitts · November 29, 2022, 8:44pm

Using MacRoman returns results like when I used ASCII. It doesn’t break regex, at least with the .iso I’m testing, but also just returns gibberish characters. Out of curiosity, I took this string:

Languages available: ��

and defined the encoding for all the encodings and then searched it with this pattern:

rg.searchpattern = "Languages available: (.+)"

If the regexmatch was nil I returned “NO MATCH.” Some, like the symbols, I guess should be expected not to match. Results:

ASCII: ��
DOSArabic: ??
DOSBalticRim:
DOSCanadianFrench:
DOSChineseSimplif: ??
DOSChineseTrad: ??
DOSCyrillic:
DOSGreek:
DOSGreek1:
DOSGreek2:
DOSHebrew:
DOSIcelandic:
DOSJapanese: ??
DOSKorean: ??
DOSLatin1:
DOSLatin2:
DOSLatinUS:
DOSNordic:
DOSPortuguese:
DOSRussian:
DOSThai: 
DOSTurkish:
ISOLatin1: ÿÿ
ISOLatin2: ˙˙
ISOLatin3: ˙˙
ISOLatin4: ˙˙
ISOLatin5: ÿÿ
ISOLatin6: ĸĸ
ISOLatin7: ’’
ISOLatin8: NO MATCH
ISOLatin9: ÿÿ
ISOLatinArabic: ??
ISOLatinCyrillic: џџ
ISOLatinGreek: ??
ISOLatinHebrew: ??
KOI8_R: ЪЪ
MacArabic: NO MATCH
MacArmenian: NO MATCH
MacBengali: NO MATCH
MacBurmese: NO MATCH
MacCeltic: ẃẃ
MacCentralEurRoman: ˇˇ
MacChineseSimp: ……
MacChineseTrad: ……
MacCroatian: ˇˇ
MacCyrillic: €€
MacDevanagari: ??
MacDingbats: NO MATCH
MacEthiopic: NO MATCH
MacExtArabic: NO MATCH
MacGaelic: ẃẃ
MacGeorgian: NO MATCH
MacGreek:
MacGujarati: ??
MacGurmukhi: ??
MacHebrew: NO MATCH
MacIcelandic: ˇˇ
MacJapanese: ……
MacKannada: NO MATCH
MacKhmer: NO MATCH
MacKorean: ……
MacLaotian: NO MATCH
MacMalayalam: NO MATCH
MacMongolian: NO MATCH
MacOriya: NO MATCH
MacRoman: ˇˇ
MacRomanian: ˇˇ
MacRomanLatin1: ÿÿ
MacSinhalese: NO MATCH
MacSymbol: NO MATCH
MacTamil: NO MATCH
MacTelugu: NO MATCH
MacThai: ??
MacTibetan: ??
MacTibetan: ??
MacTurkish: ˇˇ
MacVietnamese: NO MATCH
ShiftJIS: ??
SystemDefault: ˇˇ
UTF16: NO MATCH
UTF16BE: NO MATCH
UTF16LE: NO MATCH
UTF32: NO MATCH
UTF32BE: NO MATCH
UTF32LE: NO MATCH
UTF8: NO MATCH
WindowsANSI: ÿÿ
WindowsArabic: ےے
WindowsBalticRim: ˙˙
WindowsCyrillic: яя
WindowsGreek: ??
WindowsHebrew: ??
WindowsKoreanJohab: ??
WindowsLatin1: ÿÿ
WindowsLatin2: ˙˙
WindowsLatin5: ÿÿ
WindowsVietnamese: ÿÿ

Chay_Wesley · November 29, 2022, 8:50pm

As a test, you might try piping (or redirecting) stdout of your shell to an actual file, then run this from a terminal:

file -I filename

That should give you an idea of the file encoding and character set.

Scott_Griffitts · November 29, 2022, 9:09pm

I tried:

cmd = cmd + " > /users/scott/test/test.txt"
sh.execute(cmd)

Interestingly, the resulting file doesn’t include everything the shell.result does, but it does have the part with the problematic characters. Running the file command on it reports text/plain; charset=iso-8859-1 which sounds like ASCII.

Chay_Wesley · November 29, 2022, 9:25pm

Posting up that test.txt file somewhere where we can get it will go a long way toward resolving this.