FolderItem.name encoding subtly different between 2019R1.1 and 2019R3

  1. ‹ Older
  2. 2 weeks ago

    Tim S

    Jan 13 Pre-Release Testers Canterbury, UK

    @JonathanAshwell Krämer 2013.pdf

    In R1.1, the bytes are

    4B72 C3A4 6D65 7220 3230 2E70 6466

    In R3 they are

    4B72 61CC 886D 6572 2032 3031 332E 7064 66.

    R1.1

    U+00E4 ä c3 a4 LATIN SMALL LETTER A WITH DIAERESIS

    R3

    61 is 'a', then followed by:

    U+0308 ̈ cc 88 COMBINING DIAERESIS

    (UTF-8 info from https://www.utf8-chartable.de/unicode-utf8-table.pl )

  3. Jonathan A

    Jan 13 Pre-Release Testers Maryland, USA

    You may not be surprised but I certainly didn't expect this change in behavior (or I wouldn't have spent hours tracking down the problem). I understand why it happened and didn't post here because I thought it was a bug. I posted so that others who encounter this it might figure it out more quickly than I did.

  4. Tim S

    Jan 13 Pre-Release Testers Canterbury, UK

    @JonathanAshwell You may not be surprised but I certainly didn't expect this change in behavior (or I wouldn't have spent hours tracking down the problem). I understand why it happened and didn't post here because I thought it was a bug. I posted so that others who encounter this it might figure it out more quickly than I did.

    While that is good, better would be to request a note in the documentation.

  5. Tim P

    Jan 13 Pre-Release Testers, Xojo Pro Rochester, NY

    @Tim S While that is good, better would be to request a note in the documentation.

    http://docs.xojo.com/Resources:2019r2_Release_Notes
    "FolderItem updated to use latest OS APIs on macOS."

    Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.

  6. Michael D

    Jan 13 Pre-Release Testers, Xojo Pro

    I also find this issue shows up when moving files between mac and Windows filesystems (prior to 2019).

    See https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277 for a MBS-based way to compose or decompose. https://forum.xojo.com/20173-bug-in-replaceall/p1#p169277

  7. Tim S

    Jan 13 Pre-Release Testers Canterbury, UK

    @Tim P http://docs.xojo.com/Resources:2019r2_Release_Notes
    "FolderItem updated to use latest OS APIs on macOS."

    Knowing what the side effects of that may be is just too much to ask. FolderItem behaves differently, and it is documented why.

    Release Notes are interesting, but shouldn't be a source of documentation, any more than threads here should be. I think it's entirely reasonable for the change in OS APIs on macOS to be mentioned on the doc page for FolderItem, along with any side-effects that people can report as and when they find them.

    In fact, more generally, one might copy what the PHP pages do, where each PHP documentation page includes, at the end, user-contributed notes which would be ideal for this sort of thing.

  8. Michael H

    Jan 13 Pre-Release Testers, Xojo Pro Europe (Hamburg, Germany)

    In Unicode, different (i.e. decomposed and precomposed) sequences of code points – and thus bytes – may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

    The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.

  9. Christian S

    Jan 14 Pre-Release Testers, Xojo Pro, XDC Speakers, Third Party Store Germany

    If you use MBS Xojo Plugins , you can use ConvertUnicodeToCharacterCompositionMBS and ConvertUnicodeToCharacterDecompositionMBS to convert.

  10. Norman P

    Jan 14 Pre-Release Testers, Xojo Pro outside

    @Michael Hszlig;mann In Unicode, different (i.e. decomposed and precomposed) sequences of code points – and thus bytes – may represent the same character and are to be treated as canonically equivalent. Apple appears to prefer fully decomposed file names but thankfully normalizes names received from foreign file systems so on macOS there cannot be two files with visually identical names that are yet considered distinct.

    The upshot is that one cannot rely on some text to always be represented by the same sequence of bytes, even when the encoding is the same.

    except xojo does NOT behave this way - although maybe it should ?

    Dim mb1 As New memoryblock(16)
    mb1.UInt8Value(0) = &h4B
    mb1.UInt8Value(1) = &h72 
    mb1.UInt8Value(2) = &hC3
    mb1.UInt8Value(3) = &hA4 
    mb1.UInt8Value(4) = &h6D
    mb1.UInt8Value(5) = &h65 
    mb1.UInt8Value(6) = &h72
    mb1.UInt8Value(7) = &h20 
    mb1.UInt8Value(8) = &h32
    mb1.UInt8Value(9) = &h30 
    mb1.UInt8Value(10) = &h31 
    mb1.UInt8Value(11) = &h33
    mb1.UInt8Value(12) = &h2E 
    mb1.UInt8Value(13) = &h70
    mb1.UInt8Value(14) = &h64 
    mb1.UInt8Value(15) = &h66
    
    Dim mb2 As New memoryblock(17)
    
    mb2.UInt8Value(0) = &h4B
    mb2.UInt8Value(1) = &h72 
    mb2.UInt8Value(2) = &h61
    mb2.UInt8Value(3) = &hCC 
    mb2.UInt8Value(4) = &h88
    mb2.UInt8Value(5) = &h6D 
    mb2.UInt8Value(6) = &h65
    mb2.UInt8Value(7) = &h72 
    mb2.UInt8Value(8) = &h20
    mb2.UInt8Value(9) = &h32 
    mb2.UInt8Value(10) = &h30
    mb2.UInt8Value(11) = &h31 
    mb2.UInt8Value(12) = &h33
    mb2.UInt8Value(13) = &h2E 
    mb2.UInt8Value(14) = &h70
    mb2.UInt8Value(15) = &h64 
    mb2.UInt8Value(16) = &h66
    
    
    Dim str1 As String = DefineEncoding(mb1, Encodings.UTF8)
    Dim str2 As String = DefineEncoding(mb2, Encodings.UTF8)
    
    If mb1 = mb2 Then
      Break
    End If
    
    If str1 = str2 Then
      break
    End If
    

    you will not hit either break point

  11. Norman P

    Jan 14 Pre-Release Testers, Xojo Pro outside

    Xojo does not correctly implement canonical equivalence from the unicode standard
    see https://unicode.org/reports/tr15/
    Feedback Case #58838

  12. Michael H

    Jan 14 Pre-Release Testers, Xojo Pro Europe (Hamburg, Germany)

    @Norman P except xojo does NOT behave this way - although maybe it should ?

    Yeah, I suppose the operator = should respect canonical equivalence. It is probably a good idea to use String.Compare instead – which already does.

  13. Norman P

    Jan 14 Pre-Release Testers, Xojo Pro outside

    It would not surprise me to find that most people would expect = to do the same

  14. Tim J

    Jan 15 Pre-Release Testers N. Phoenix, AZ
    Edited 2 weeks ago

    I have to thank Jonathon for starting this thread. I ran into this and just marked it up to one more reason for me to completely ignore r2.1+ for real work.

  15. John M

    Jan 15 Pre-Release Testers, Xojo Pro New York / New Jersey

    @Michael Hszlig;mann Yeah, I suppose the operator = should respect canonical equivalence. It is probably a good idea to use String.Compare instead – which already does.

    This seems like something that would be easy for Xojo to implement, so we don't have to wade through miles of code looking for places where strings are compared using = and then replacing = with a function would be tedious in the extreme, to say nothing of prone to mistakes.

    Or is there some existing easy way to implement this?

  16. Norman P

    Jan 15 Pre-Release Testers, Xojo Pro outside

    Xojo already has it implemented just not when using the = operator between two strings :(
    Making that work should be VERY doable

  17. Michael H

    Jan 15 Pre-Release Testers, Xojo Pro Europe (Hamburg, Germany)

    @John M This seems like something that would be easy for Xojo to implement, so we don't have to wade through miles of code looking for places where strings are compared using = and then replacing = with a function would be tedious in the extreme, to say nothing of prone to mistakes.

    Testing for equality with = is much faster than using String.Compare and I suppose we wouldn‘t want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).

    Until canonical equivalence is properly implemented throughout Xojo, using String.Compare for any text received from the outside world may yet be your best bet.

    @John M Or is there some existing easy way to implement this?

    I have tried converting the encoding from UTF8 to some other variant of Unicode and back, hoping the text would be normalized in the process, but to no avail.

  18. Kevin G

    Jan 17 Pre-Release Testers, Xojo Pro Gatesheed, England

    @Michael Hszlig;mann Testing for equality with = is much faster than using String.Compare and I suppose we wouldn‘t want = to become less efficient when comparing strings. I have no idea what fixing = would entail with regard to speed. It might be preferable to normalize strings internally so any code could rely on canonically equivalent text to be represented by the same sequence of code points and bytes (always precomposed or always decomposed, whatever).

    We wouldn't want strings to be normalized automatically or the performance of = to be reduced.
    Maybe the strings should have a Normalize method for people who need it or just use the String.Compare method that is available today.

  19. Norman P

    Jan 17 Pre-Release Testers, Xojo Pro outside
    Edited 2 weeks ago

    that = doesnt treat ü and ü as "equal" is really a problem
    I would expect that most users would be VERY surprised that it doesnt and it would be the unusual case where you would NOT want it to do that

  20. Tim S

    Jan 17 Pre-Release Testers Canterbury, UK

    @Norman P that = doesn't treat ü and ü as "equal" is really a problem

    Yes. Although the root of the problem seems to be Unicode allowing both decomposed and precomposed characters in the first place. But given that situation, treatment should be transparent.

  21. Robert W

    Jan 17 Western Canada

    Just out of curiosity, does FolderItem.PathTypeURL give both precomposed and decomposed characters, or does it convert them to a single type?

or Sign Up to reply!