I have a big problem. I have a global array that contains all the first level domans (.com, .eu, .it, .de…), and I have a method that reads a string, extract all .something strings and verify if that .something is in the first level domains array.
So, in some cases this method do not work. For example: I have a .doc file, I open it in OpenOffice, I select all the text copy and paste in my app, then I launch my scan method… it works! Other case: I have the same .doc file, I open in my app, I read it as binarystream, I pass the string I read to the scan method, in the debug mode I can see the .something string is been processed, but the indexof function returns me -1!
I’ve read in the Xojo documentation, that “IndexOf is not case-sensitive, but it is encoding-sensitive”, so I think it can be my problem.
The row of my scan method that returns me the two different result is:
if FirstLevelDomains.IndexOf(FoundString)>-1 then
The foundstring is converted with defineencoding(string,encodings.utf8)
Anyone has suggestions about it? Many thanks
Can it be that FoundString contains a character which you can’t see in debug mode (unless you switch from text to binary view). For example it could be an utf-8 null terminated string.
Without seeing your code this is really hard to debug. If you have plain vanilla top-level domains encoding should matter. Same for pre-composed or de-composed characters. Have you tried instr instead of an equal comparison?
Not sure whether it has any bearing on the issue but DefineEncoding does not convert the string; it only defines its encoding to be UTF8 (whatever it actually is). Did you make sure that it really is?
You have to check the string encoding before set it.
If it’s nil and YOU KNOW that is UTF you can define it as utf
if it’s not nil the you have to use convertEncodings
if it’s nil and you DON’T KNOW its real encoding you have to find it before define (and maybe then convert)
Usually reading from binary or from socket you get a nil encoding string. But you have to be sure that it’s UTF8 before defining it as UTF8
Ok, but, is there a way to get the encoding used in the document?
Thats for you to find out. Sometimes you just know because it is a document you have created yourself. Or you happen to know a certain app will always use a certain encoding. Or you look whether there is a BOM and check that. There may also be an explicit declaration of the encoding used, as in HTML or XML files.
Or I have to check all the encodings, right?
I think the problem is the encoding… I found the hexadecimal info different for the same string if I open the document than if I drop the string…
Thanks to all for now!
What are the string and the hex values? You can sometimes recognize the encoding based on that.