I am trying to compare two names and figure out if they are the same, different, same but case different…
This code just doesn’t want to work… (Maybe it’s late…)
Test vs TEST … says the are just different.
if StrComp( self.Name, rhs.Name, REALbasic.StrCompLexical) = 0 then
System.DebugLog("Same Name but case may different")
if StrComp( self.Name, rhs.Name, REALbasic.StrCompCaseSensitive) <> 0 then
System.DebugLog("Same Name :" + self.Name + ": vs :" + rhs.Name + ": case different")
return 1
end if
else
System.DebugLog("Different Name :" + self.Name + ": vs :" + rhs.Name + ":")
return -1
end if
You are using REALbasic.StrCompCaseSensitive, which means that the comparison is case sensitive
Therefore Test is indeed different in comparison with TEST.
Try REALbasic.StrCompLexical instead.
The Lexical comparison option is only affecting how text gets ordered (i.e. it gets ordered differently than if you used “<” and “>” to compare strings).
If you want to actually ignore case, then convert both strings to lowercase first.
good idea, but…
…there are some other things to consider when doing that.
first off: have both to-be-compared Strings the same Encoding?
And if so: which Encoding?
WindowsANSI: Lowercase is kind of broken… especially with special chars (Umlaut), such as “Ü”. See <https://xojo.com/issue/54926>
UTF8: Converting to UTF8 is a workaround for the WindowsANSI issue. You might have UTF8 strings anyway. But… where do they come from? Are they both the same way (pre/de-composed)? If not, “ü = ü” might be false, since there are different binary representations for the very same “visual” character. So keep that in mind when doing binary comparisons of UTF8 strings (e.g. with StrComp).
I haven’t found much in Feedback regarding UTF8 and pre/decomposed Strings, and how to normalize them in Xojo… An old one talking about issues is <https://xojo.com/issue/19163>. Shouldn’t there be one requesting a way to normalize UTF8 Strings?
I want to know that the strings are
a) Test <> Test (Different character set encoding)
b) Test = TEST (Lexically the same)
c) Test <> TEST. (Different because of case)
i.e. I want to KNOW that they differ only by case.
Maybe that’s clearer.
if self.Name.Encoding <> rhs.Name.Encoding then
System.DebugLog("Character encodings are different.")
return -1
end if
if self.Name <> rhs.Name then
System.DebugLog("The names are lexically different")
return -1
end if
if StrComp(self.Name, rhs.Name, REALbasic.StrCompCaseSensitive) = 0 then
System.DebugLog("The names are identical")
return 0
end if
return -1
[quote=429605:@Thomas Tempelmann]
If you want to actually ignore case, then convert both strings to lowercase first.[/quote]
There are language where this literally cant work oddly enough
I think Turkish is one where there are some letters in upper case that have no lower case equivalent (maybe I have that backwards)
By lower casing you actually change the “words”
I’m not trying to ignore the case I’m trying to log why the comparison of two string has failed.
either the encodings don’t match or the case doesn’t match or its a match. but i have to log why…
We have NSStringCompareMBS in the plugin, which uses Apple’s framework to compare and you can specify options like case insensitive, diacritics insensitive and width differences insensitive.
Between UTF8 and ASCII encoding difference will only be noted for characters above 0x7F including those that are multi-byte, all characters below 0x7F will compare identically in either encoding
Yes they compare fine… but i though it wasn’t a sure thing.
I am learning that some encodings are partially compatible with each other.
Anything 7 bit in ascii will compare with UTF8 but not above 128.
I’m still trying to figure out if there is a way to promote all strings to UTF-16 or ISO Latin and then compare in that encoding…
If you “promote to UTF-16” then the encoding will be UTF-16… it will at that point have no bearing on what its previous encoding was
So if you take a string encoding UTF-8 and a String encoded ISO-Latin, and promote each to UTF-16, now you have TWO UTF-16 strings… and those ASCII characters will simply become 00xx where xx was the UTF-8/ASCII character