Bug in String.ReplaceAll?

Rick_Araujo · January 13, 2023, 3:05pm

Get one of those, acquire the file name string reading it direct from the system, and show us 2 things: 1. the string as shown, and 2. its EncodeHex( mystring, True)

Let’s inspect those bytes to understand the problem.

Jeff_Tullin · January 13, 2023, 3:16pm

You say ’ code that gets the name of the file’ … does the file exist before you start looking at the name??

(Usually when stripping reserved characters from a proposed name, I do that on a name entered manually by the user, not in a pre-existing file.

So if they type “AB:CD/EF;”
I strip it back to AB_CD_EF before creating a folderitem as
f = someparent.child(fname)

Julia_Truchsess · January 14, 2023, 1:10am

To answer the question in the topic title, no, I think it’s pretty unlikely there’s a bug in String.ReplaceAll. This function has been in the language since forever and I’ve never heard of any issues with it.

Aurelian_N · January 16, 2023, 1:11pm

Hello guys,

Sorry for the late reply. Apparently it is a total mess here, the files are stored on a Linux NAS, used on MacOS and most come from Windows, so the perfect combination.
Now the crazy part is that they have all the possible forbidden characters and none of the OS’s complained and they keep on using it this way. To give you an example , i have “/Photos 02:18/12:30/mnt/data1/eOR\snapshot\CAP008\CASE039\I038.JPG” and many more that are complete headache for sorting and processing. i

So i tried the process of cleaning on MacOS , on Linux , same way, some fail some pass, in this case the result was “12_30_mnt_data1_eOR_snapshot_CAP008_CASE039_I038.JPG” i move this file as the new name and when i run in shell

sh.Execute("file --mime-type " + f.ShellPath)

For some reason i keep on getting

18/_mnt_data1_eOR_snapshot_CAP008_CASE039_I038.JPG as file extension instead of the proper image/jpeg i guess because the : kicks in and it is split and some before that break the path.

I will try to see with Native path what happens.

The idea is that i need to process all those files based on the location where they are and to upload them to an api but i get round 300 fails because of those. In the end if i still cannot get them i will have to process them manually and done.

Emile_Schwarz · January 16, 2023, 1:29pm

“:” is illegal on macOS, so you cannot have it in a file that came from macOS.

Advice:
create a new project who deals exclusively on file names change. When working fine, you will incorporate it to the ain project.
Doing so will avoid potential troubloes with other parts of the current project (this can arise, sometimes) and will speed up the debug process…

Rick_Araujo · January 16, 2023, 1:44pm

Can you show us an small sample of the code with this happening? Maybe there’s some interference FROM the entire process? As using improperly JSON strings at some point and backslashes being interpreted as escape codes while decoding, or some interference like that.

This also may be important to fully understand:

Greg_O · January 16, 2023, 2:00pm

another suggestion…

Try looking at the file names in Terminal on macOS. That might give you a hint as to how macOS translates the file name.

Ian_Kennedy · January 16, 2023, 3:17pm

It is also illegal on Widows. At least for posterity “C:” the hard drive. On Mac the : used to be the folder separator, but MacOS X, when the unit / too over.

Eric_Wilson · January 16, 2023, 4:01pm

Different filesystems support different encodings for different countries with different reserved words and characters. It’s a dogs breakfast. The worst are spaces, which can be rendered for chr(10) or non printing characters or web code %2F I think. And when files are transferred from one OS to another to another, there can be subtle changes in a tiny percentage.

So maybe you should stick to what’s real by using ReplaceAllB / ReplaceAllBytes. Then you know if something isn’t looking right it can be fixed because you aren’t relying on Xojo to guess where the users are coming from.

Rick_Araujo · January 16, 2023, 4:11pm

Nope. It may silently break invisible utf-8 composition sequences resulting in a worst scenario.

I still want to see ONE complete example of a problem as requested here:

From there we can evolve.

Eric_Wilson · January 16, 2023, 4:23pm

You would scan for the double byte offenders first I guess. There will always be some ambiguity with any name standardisation scheme.

Rick_Araujo · January 16, 2023, 4:54pm

What is necessary is to normalize all data acquisition of those names to a common encoding, for current Xojo (and macOS an Linux), it is UTF-8. You shouldn’t play changing with random bytes in UTF-8 compositions, because for example, something as, I don’t know, “/” or “:”, can be part of an internal 4 bytes sequence shown to the user as just one char.

Aurelian_N · January 16, 2023, 6:40pm

Well , how would you handle such files in this case when you have no idea where they come from ? any recommended flow ? so far we supposed to have designated folders for each part and we discovered that they use those folders to throw all kind of data there and the worst part is that they write poems in the file name, and a lot of forbidden characters for each platform so no idea how MacOS is handling it under the hood as in Finder nothing breaks , but once i use code on my side it messes things up.

I do have a backup of the original folders structure but it’s around 200 GB so i don’t want to go like crazy for each case . So far from 164942 files i have around 200 files with issues so i could try to move the whole folder to a separate sorting folder and let them handle that manually on their side to avoid this headache

Pawel_Soltysinski · January 16, 2023, 7:17pm

I am late to the topic but I had this problem with semicolons, back slashes etc. when trying to get paths of files from various systems. I cannot find the project right now but the idea was that there is possible more than one code for each of such special characters. Just open Show Emoji & Symbols from keyboard menu in MacOS and type “semicolon” in search fields and you will see… There is simply MORE than one code for semicolon, more than one code for backslash and so on… it all depends on operating systems, locales and language versions.

Side note - when replacing two the same characters with one, keep in mind that you need to perform it twice or more, since:

ReplaceAll("AxB","xx","x") gives you "AxB"
ReplaceAll("AxxxB","xx","x") gives you "AxxB"
ReplaceAll("AxxxxB","xx","x") gives you "AxxB"
ReplaceAll("AxxxxxB","xx","x") gives you "AxxxB"

Eric_Williams · January 16, 2023, 8:17pm

Check the text encoding of the filenames. It’s likely that they are inaccurate, giving you these oddball search and replace results.

Eric_Williams · January 16, 2023, 8:18pm

No - do not do this. Work within the text encoding scheme, or suffer a lifetime of misery in your code and support tickets.

Thomas_Tempelmann · January 17, 2023, 12:25am

As I explained before here in the forum just a week or so ago, “:” is NOT illegal on MacOS. It depends on the API that you use. If you use an API / function that uses POSIX paths (with “/”) as Separators, then you can use “:” in the file or directory name. OTOH, if you deal with functions based on the old HFS naming scheme, as the Finder does, and also FolderItem.Name and FolderItem.Child, then you use instead a “/”, which is then legal.
“/” shows where “:” does not, and vice versa.
So, there are effectively only two illegal chars:
The zero byte (even that used to be legal but is now causing problems) and either “:” or “/” depending on the API - but they’re interchangeable.

Thomas_Tempelmann · January 17, 2023, 12:35am

First of all, replacing “…” into “.” has to be done repeatedly, until there are no “…” left, or you’ll get a “…” wrong.
And, as you already wrote in a previous post, I believe, you need to check if a name ends with a period, and then remove that, too.

It appears you do not retrieve the names correctly, or do not rename them in the correct way.

Basically, you’d get the name from FolderItem.Name, put it into a String, then do your replacements on the string, and finally assign that cleaned string to the Name property of the file to rename it, or create a new FolderItem with the Child() function, passing that cleaned string.
Also, you may want to first separate the extension from the name, by finding the last period, using the InStr function, like this (out of my head):

dim fname as String = "test.txt"
dim start as Integer
dim extpos as Integer
do
  start = fname.Instr(start+1, ".")
  if start = 0 then break
  extpos = start
loop

Now exppos should have the position of the extension, which you can extract with Mid(extpos), and then shorten the fname with Left(extpos-1). But don’t do this if extpos = 0, because then there was no dot.

Which of this is a file name and what is a directory path? A file name can’t have both the “:” and “/” in its name, as neither macOS not Linux nor Windows would allow this. This what you’re showing is clearly a path, not a file name. That confusion alone suggests that you’re not understanding the difference between a pure file name and a path. Get this right, first of all.

Also, perhaps you should change the title of this topic, because it’s rather “How do I convert file names for Windows?”

Eric_Wilson · January 18, 2023, 8:14am

What text encoding scheme if a filename has been moved around systems? The encoding scheme cannot be assumed can it? If it can be assumed, why isn’t ReplaceAll working in all cases?

Eric_Wilson · January 18, 2023, 8:39am

“Linux” will allow a colon “:” in the filename. Only a slash and NULL is forbidden. However as soon as a USB stick is plugged into a Linux desktop we are talking VFAT probably. A lot of systems have adopted a version of UTF8. However the basic7-bit ASCI characters are all single byte anyway, so if there are issues with ReplaceAll they will probably be encoding-related. Linux file systems store filenames byte-for-byte undencoded for the reasons we are talking about here. In IT it’s best if there’s only one source of the truth where possible.