Please can you help me with the following problem I have.
I receive a file in UTF16 format from an external source. The file is a language file (English.lng). I am using xojo 2015 r2.3 on Windows.
Please take a look at the following example code :
Using Xojo.core
Using Xojo.IO
Dim f_Value As Xojo.IO.FolderItem
Dim f_AppData As xojo.IO.FolderItem
Dim tisRead As TextInputStream
Dim LanguageFileType As New FileType
Dim textLanguageList(-1) As Text
Dim intPTR_Array As Integer
// Setup the LanguageFileType typeset
LanguageFileType.Name = "Language"
LanguageFileType.MacType = "LNG"
LanguageFileType.Extensions = "lng;LNG"
// Getting a language file
f_AppData = SpecialFolder.Documents.Child("IObit_Babylon_UserData")
f_Value = f_AppData.Child("English.lng")
If f_Value.Exists Then
MsgBox "I do exists"
Else
MsgBox "I do not exists"
end If
intPTR_Array = -1
// Opening the file as UTF16
tisRead = TextInputStream.Open(f_Value, TextEncoding.UTF16)
While Not tisRead.EOF
intPTR_Array = intPTR_Array + 1
ReDim textLanguageList(intPTR_Array)
textLanguageList(intPTR_Array) = tisRead.ReadLine
MsgBox textLanguageList(intPTR_Array)
Wend
tisRead.Close
MsgBox Str(UBound(textLanguageList))
When I read this file line by line and store them inside array “textLanguageList” I got a diamond with questionmark sign after every line. Also between every line, there is a empty line containing the same diamond sign present. The original file has no empty lines between the text. to give you an idea, the original file contains about 2175 lines, the array contains 4354 lines.
I isolated the problem with the line which read a line (tisRead.ReadLine). So I think it is an encoding problem.
Do you have any idea what is going wrong? The code above is taken from the application itself. Can you tell me how I can solve this problem?
I thank you very much for your time spend on my problem which is very much appreciated.
Friendly greetings,
Chris
PS : I have to re-create this thread because the first was lost and did not show up.
I found out that when I do a ReadAll into a text variable, and then show the content of the text variable into a TextArea object, the whole file is intact without those diamonds or empty lines.
I will try the ReplaceLineEndings function and tell you the results.
I just tried out ReplaceLineEndings but the result is still the same. I was hoping it would work because it was an easy fix but now it seems not that easy.
set a break point in Xojo and examine the hex values of the text in a property (click the binary tab). See what hex values those characters represent. Then you can do a replace on those hex characters and eliminate them.
It’s definitely an encoding problem. I started seeing it when I would use chr(13) as a line ending instead of EndofLine.
In the meantime I copied the text from the textfield in Xojo to an empty document in TopStyle 5 (is an HTML editor). Then I opened the original English language file.
There is no difference between the text coming from Xojo (with the ReadAll) and the text coming from the original file. So the readall works fine while using ReadLine mess things up.
But we can be sure that the encoding is UTF16 I think. I will try your suggestion and see what happened.
I already did that with TopStyle which is like NotePad++.
I followed the debugging steps suggested by Jon and find out that the character which causes the diamond symbol, has a hexadecimal value of FFFD and is show in the debugger as an empty rectangle or square.
Now I have a question, how to replace that hexadecimal character. I tried textRecord.Replace(&hFFFD, “”) but that does not seem to work. Anybody any idea how I can replace that character?
Thank you all again, I really do appreciate your efforts to help me.
So now I have two cases;
An empty line containing just the hex character EFBFBD which has to be removed
A line ending with the hex character EFBFBD where I have to remove the last character, which I can easily do with the “split”.
I hope I am on the right track because I am working against a deadline which comes close.
I found a way to remove those unwanted characters at the end of the lines and also from the empty lines.
I post the example code here so everybody in the same situation can use it.
Using Xojo.core
Using Xojo.IO
Dim f_Value As Xojo.IO.FolderItem
Dim f_AppData As xojo.IO.FolderItem
Dim tisRead As TextInputStream
Dim LanguageFileType As New FileType
Dim textRecord As Text
Dim textArraySplit(-1) As Text
Dim intPosSplit As Integer
// Setup the LanguageFileType typeset
LanguageFileType.Name = "Language"
LanguageFileType.MacType = "LNG"
LanguageFileType.Extensions = "lng;LNG"
// Getting a language file
f_AppData = SpecialFolder.Documents.Child("MyApp_UserData")
f_Value = f_AppData.Child("English.lng")
If f_Value.Exists Then
MsgBox "I do exists"
Else
MsgBox "I do not exists"
end If
// Opening the file as UTF16
tisRead = TextInputStream.Open(f_Value, TextEncoding.UTF16)
While Not tisRead.EOF
textRecord = tisRead.ReadLine
textArraySplit = textRecord.Split // Here we have every character inside the array
If textArraySplit(textArraySplit.Ubound) = DecodeHex("EFBFBD") Then
intPosSplit = textRecord.IndexOf(DecodeHex("EFBFBD").ToText)
textRecord = textRecord.Left(intPosSplit)
End If
MsgBox textRecord
Wend
tisRead.Close
Hope you will find it usefull when having the same problems. It also shows you some things about the new framework which where not easy to master in the beginning. I am still struggling with the new framework despite it goes much better.
But as an example and learning, it is easier to understand. That was the purpose of the example I made. If you watch Paul Lefebvre his webinars, many times he does exactly the same because it make things much more clear.
Thank you very much for your suggestion. Wish you all the best.
Glad I was able to help point you in the right direction. I have had similar problems with odd characters being read and causing issues. I regularly see stuff like this when I open a tcp/ip connection to the console login for managed network switches. For some reason the initial set of characters that first get dumped to the socket are garbage - high value characters that you need to find using the binary tab in the debugger.