Bug in String.ReplaceAll?

Aurelian_N · January 13, 2023, 2:06pm

Greetings,

i have the following code that gets the name of the file and it tries to remove all kind of shady characters that users put to make them normalised so they can be processed.

So far i have this in the code :

fName = fName.Trim
fName = fName.ReplaceAll(":", "")
fName = fName.ReplaceAll("..",".")
fName = fName.ReplaceAll(" .", ".")
fName = fName.ReplaceAll(" ;", "_")
fName = fName.ReplaceAll(";", "_")
fName = fName.ReplaceAll("/", "_")
fName = fName.ReplaceAll("\", "_")
fName = fName.ReplaceAll("`", "")

But apparently they are more like hit or miss than to do the job, i have files that have in them ; and \ and / specially that come from windows, and : but so far this does not replace the characters at all.

Any idea how to do this effective ?

I have to be careful with the “.” so that it does not mess the extension part .

Thanks.

Kem_Tekinay · January 13, 2023, 2:16pm

I’d use regular expressions for this, but that’s another story.

I don’t see a problem with your code so an example of where it isn’t working would be helpful.

Aurelian_N · January 13, 2023, 2:20pm

specially “;” or "", or “/” For Example i have this as name , when passing that filter, the file stays the same the / after 2018 does not get replaced which crashes another part of the app which identifies the UTI using file -i on shell on linux “2018/_mnt_data1_eOR_snapshot_CAP007_CASE034_I037.JPG”

Kem_Tekinay · January 13, 2023, 2:23pm

I wrote a quick test app using that string and copy-and-pasting your code from above:

var fName as string = "2018/_mnt_data1_eOR_snapshot_CAP007_CASE034_I037.JPG"

AddToResult "Original: " + fName

fName = fName.Trim
fName = fName.ReplaceAll(":", "")
fName = fName.ReplaceAll("..",".")
fName = fName.ReplaceAll(" .", ".")
fName = fName.ReplaceAll(" ;", "_")
fName = fName.ReplaceAll(";", "_")
fName = fName.ReplaceAll("/", "_")
fName = fName.ReplaceAll("\", "_")
fName = fName.ReplaceAll("`", "")

AddToResult "Modified: " + fName

This is the output:

Original: 2018/_mnt_data1_eOR_snapshot_CAP007_CASE034_I037.JPG
Modified: 2018__mnt_data1_eOR_snapshot_CAP007_CASE034_I037.JPG

Aurelian_N · January 13, 2023, 2:52pm

as said it is not always working , i have cases where it works, cases where it does not work, if i run the code again, it might fire, i have around 180.000 files to process , that is the weird part

Rick_Araujo · January 13, 2023, 3:05pm

Get one of those, acquire the file name string reading it direct from the system, and show us 2 things: 1. the string as shown, and 2. its EncodeHex( mystring, True)

Let’s inspect those bytes to understand the problem.

Jeff_Tullin · January 13, 2023, 3:16pm

You say ’ code that gets the name of the file’ … does the file exist before you start looking at the name??

(Usually when stripping reserved characters from a proposed name, I do that on a name entered manually by the user, not in a pre-existing file.

So if they type “AB:CD/EF;”
I strip it back to AB_CD_EF before creating a folderitem as
f = someparent.child(fname)

Julia_Truchsess · January 14, 2023, 1:10am

To answer the question in the topic title, no, I think it’s pretty unlikely there’s a bug in String.ReplaceAll. This function has been in the language since forever and I’ve never heard of any issues with it.

Aurelian_N · January 16, 2023, 1:11pm

Hello guys,

Sorry for the late reply. Apparently it is a total mess here, the files are stored on a Linux NAS, used on MacOS and most come from Windows, so the perfect combination.
Now the crazy part is that they have all the possible forbidden characters and none of the OS’s complained and they keep on using it this way. To give you an example , i have “/Photos 02:18/12:30/mnt/data1/eOR\snapshot\CAP008\CASE039\I038.JPG” and many more that are complete headache for sorting and processing. i

So i tried the process of cleaning on MacOS , on Linux , same way, some fail some pass, in this case the result was “12_30_mnt_data1_eOR_snapshot_CAP008_CASE039_I038.JPG” i move this file as the new name and when i run in shell

sh.Execute("file --mime-type " + f.ShellPath)

For some reason i keep on getting

18/_mnt_data1_eOR_snapshot_CAP008_CASE039_I038.JPG as file extension instead of the proper image/jpeg i guess because the : kicks in and it is split and some before that break the path.

I will try to see with Native path what happens.

The idea is that i need to process all those files based on the location where they are and to upload them to an api but i get round 300 fails because of those. In the end if i still cannot get them i will have to process them manually and done.

Emile_Schwarz · January 16, 2023, 1:29pm

“:” is illegal on macOS, so you cannot have it in a file that came from macOS.

Advice:
create a new project who deals exclusively on file names change. When working fine, you will incorporate it to the ain project.
Doing so will avoid potential troubloes with other parts of the current project (this can arise, sometimes) and will speed up the debug process…

Rick_Araujo · January 16, 2023, 1:44pm

Can you show us an small sample of the code with this happening? Maybe there’s some interference FROM the entire process? As using improperly JSON strings at some point and backslashes being interpreted as escape codes while decoding, or some interference like that.

This also may be important to fully understand:

Greg_O · January 16, 2023, 2:00pm

another suggestion…

Try looking at the file names in Terminal on macOS. That might give you a hint as to how macOS translates the file name.

Ian_Kennedy · January 16, 2023, 3:17pm

It is also illegal on Widows. At least for posterity “C:” the hard drive. On Mac the : used to be the folder separator, but MacOS X, when the unit / too over.

Eric_Wilson · January 16, 2023, 4:01pm

Different filesystems support different encodings for different countries with different reserved words and characters. It’s a dogs breakfast. The worst are spaces, which can be rendered for chr(10) or non printing characters or web code %2F I think. And when files are transferred from one OS to another to another, there can be subtle changes in a tiny percentage.

So maybe you should stick to what’s real by using ReplaceAllB / ReplaceAllBytes. Then you know if something isn’t looking right it can be fixed because you aren’t relying on Xojo to guess where the users are coming from.

Rick_Araujo · January 16, 2023, 4:11pm

Nope. It may silently break invisible utf-8 composition sequences resulting in a worst scenario.

I still want to see ONE complete example of a problem as requested here:

From there we can evolve.

Eric_Wilson · January 16, 2023, 4:23pm

You would scan for the double byte offenders first I guess. There will always be some ambiguity with any name standardisation scheme.

Rick_Araujo · January 16, 2023, 4:54pm

What is necessary is to normalize all data acquisition of those names to a common encoding, for current Xojo (and macOS an Linux), it is UTF-8. You shouldn’t play changing with random bytes in UTF-8 compositions, because for example, something as, I don’t know, “/” or “:”, can be part of an internal 4 bytes sequence shown to the user as just one char.

Aurelian_N · January 16, 2023, 6:40pm

Well , how would you handle such files in this case when you have no idea where they come from ? any recommended flow ? so far we supposed to have designated folders for each part and we discovered that they use those folders to throw all kind of data there and the worst part is that they write poems in the file name, and a lot of forbidden characters for each platform so no idea how MacOS is handling it under the hood as in Finder nothing breaks , but once i use code on my side it messes things up.

I do have a backup of the original folders structure but it’s around 200 GB so i don’t want to go like crazy for each case . So far from 164942 files i have around 200 files with issues so i could try to move the whole folder to a separate sorting folder and let them handle that manually on their side to avoid this headache

Pawel_Soltysinski · January 16, 2023, 7:17pm

I am late to the topic but I had this problem with semicolons, back slashes etc. when trying to get paths of files from various systems. I cannot find the project right now but the idea was that there is possible more than one code for each of such special characters. Just open Show Emoji & Symbols from keyboard menu in MacOS and type “semicolon” in search fields and you will see… There is simply MORE than one code for semicolon, more than one code for backslash and so on… it all depends on operating systems, locales and language versions.

Side note - when replacing two the same characters with one, keep in mind that you need to perform it twice or more, since:

ReplaceAll("AxB","xx","x") gives you "AxB"
ReplaceAll("AxxxB","xx","x") gives you "AxxB"
ReplaceAll("AxxxxB","xx","x") gives you "AxxB"
ReplaceAll("AxxxxxB","xx","x") gives you "AxxxB"

Eric_Williams · January 16, 2023, 8:17pm

Check the text encoding of the filenames. It’s likely that they are inaccurate, giving you these oddball search and replace results.