Strange blank lines and diamonds when reading UTF16 file

Chris_Verberne · October 27, 2015, 4:14pm

Hello Everyone,

Please can you help me with the following problem I have.

I receive a file in UTF16 format from an external source. The file is a language file (English.lng). I am using xojo 2015 r2.3 on Windows.

Please take a look at the following example code :

  
  Using Xojo.core
  Using Xojo.IO
  
  Dim f_Value As Xojo.IO.FolderItem
  Dim f_AppData As xojo.IO.FolderItem
  Dim tisRead As TextInputStream
  Dim LanguageFileType As New FileType
  Dim textLanguageList(-1) As Text
  Dim intPTR_Array As Integer
  
  // Setup the LanguageFileType typeset
  LanguageFileType.Name = "Language"
  LanguageFileType.MacType = "LNG"
  LanguageFileType.Extensions = "lng;LNG"
  
  // Getting a language file
  f_AppData = SpecialFolder.Documents.Child("IObit_Babylon_UserData")
  f_Value = f_AppData.Child("English.lng")
  
  If f_Value.Exists Then
    MsgBox "I do exists"
  Else
    MsgBox "I do not exists"
  end If
  
  intPTR_Array = -1
  // Opening the file as UTF16
  tisRead = TextInputStream.Open(f_Value, TextEncoding.UTF16)
  While Not tisRead.EOF
    intPTR_Array = intPTR_Array + 1
    ReDim textLanguageList(intPTR_Array)
    textLanguageList(intPTR_Array) = tisRead.ReadLine
    MsgBox textLanguageList(intPTR_Array)
  Wend
  tisRead.Close
  
  MsgBox Str(UBound(textLanguageList))

When I read this file line by line and store them inside array “textLanguageList” I got a diamond with questionmark sign after every line. Also between every line, there is a empty line containing the same diamond sign present. The original file has no empty lines between the text. to give you an idea, the original file contains about 2175 lines, the array contains 4354 lines.

I isolated the problem with the line which read a line (tisRead.ReadLine). So I think it is an encoding problem.

Do you have any idea what is going wrong? The code above is taken from the application itself. Can you tell me how I can solve this problem?

I thank you very much for your time spend on my problem which is very much appreciated.

Friendly greetings,

Chris

PS : I have to re-create this thread because the first was lost and did not show up.

Jon_Ogden · October 27, 2015, 4:24pm

It’s probably a line endings issue. I’ve seen this before.

Try running the ReplaceLineEndings function on your data and see if that cleans it up.

Chris_Verberne · October 27, 2015, 4:29pm

Thank you very much Jon.

I found out that when I do a ReadAll into a text variable, and then show the content of the text variable into a TextArea object, the whole file is intact without those diamonds or empty lines.

I will try the ReplaceLineEndings function and tell you the results.

Thank you again very much

Chris_Verberne · October 27, 2015, 4:38pm

I just tried out ReplaceLineEndings but the result is still the same. I was hoping it would work because it was an easy fix but now it seems not that easy.

Thank you again for your suggestion.

Jon_Ogden · October 27, 2015, 4:44pm

So here’s another thing to do…

set a break point in Xojo and examine the hex values of the text in a property (click the binary tab). See what hex values those characters represent. Then you can do a replace on those hex characters and eliminate them.

It’s definitely an encoding problem. I started seeing it when I would use chr(13) as a line ending instead of EndofLine.

Chris_Verberne · October 27, 2015, 4:53pm

Thank you again Jon for your reply.

In the meantime I copied the text from the textfield in Xojo to an empty document in TopStyle 5 (is an HTML editor). Then I opened the original English language file.

There is no difference between the text coming from Xojo (with the ReadAll) and the text coming from the original file. So the readall works fine while using ReadLine mess things up.

But we can be sure that the encoding is UTF16 I think. I will try your suggestion and see what happened.

Julio_Debroy · October 27, 2015, 5:05pm

No doubt it is an encoding problem. You could use a tool like notepad++ to determine the encoding and to convert the text.

Chris_Verberne · October 27, 2015, 5:16pm

Thank you Julio for your reply.

I already did that with TopStyle which is like NotePad++.

I followed the debugging steps suggested by Jon and find out that the character which causes the diamond symbol, has a hexadecimal value of FFFD and is show in the debugger as an empty rectangle or square.

Now I have a question, how to replace that hexadecimal character. I tried textRecord.Replace(&hFFFD, “”) but that does not seem to work. Anybody any idea how I can replace that character?

Thank you all again, I really do appreciate your efforts to help me.

Chris_Verberne · October 27, 2015, 5:28pm

I have to correct my former post.

The value is not FFFD but EFBFBD. I found that out with the following code :
Dim textValue(-1) As Text

textRecord = tisRead.ReadLine
textValue = textRecord.Split
MsgBox EncodeHex(ta(UBound(textValue)))

So now I have two cases;
An empty line containing just the hex character EFBFBD which has to be removed
A line ending with the hex character EFBFBD where I have to remove the last character, which I can easily do with the “split”.

I hope I am on the right track because I am working against a deadline which comes close.

Thank you again everybody!

Chris_Verberne · October 27, 2015, 6:01pm

Hello everybody again,

I found a way to remove those unwanted characters at the end of the lines and also from the empty lines.

I post the example code here so everybody in the same situation can use it.

  Using Xojo.core
  Using Xojo.IO
  
  Dim f_Value As Xojo.IO.FolderItem
  Dim f_AppData As xojo.IO.FolderItem
  Dim tisRead As TextInputStream
  Dim LanguageFileType As New FileType
  Dim textRecord As Text
  Dim textArraySplit(-1) As Text
  Dim intPosSplit As Integer
  
  // Setup the LanguageFileType typeset
  LanguageFileType.Name = "Language"
  LanguageFileType.MacType = "LNG"
  LanguageFileType.Extensions = "lng;LNG"
  
  // Getting a language file
  f_AppData = SpecialFolder.Documents.Child("MyApp_UserData")
  f_Value = f_AppData.Child("English.lng")
  
  If f_Value.Exists Then
    MsgBox "I do exists"
  Else
    MsgBox "I do not exists"
  end If
  
  // Opening the file as UTF16
  tisRead = TextInputStream.Open(f_Value, TextEncoding.UTF16)
  While Not tisRead.EOF
    textRecord = tisRead.ReadLine
    textArraySplit = textRecord.Split // Here we have every character inside the array
    If textArraySplit(textArraySplit.Ubound) = DecodeHex("EFBFBD") Then
      intPosSplit = textRecord.IndexOf(DecodeHex("EFBFBD").ToText)
      textRecord = textRecord.Left(intPosSplit)
    End If
    MsgBox textRecord
  Wend
  tisRead.Close

Hope you will find it usefull when having the same problems. It also shows you some things about the new framework which where not easy to master in the beginning. I am still struggling with the new framework despite it goes much better.

wish you all the very best.

Friendly greetings,

Chris

Michel_Bujardet · October 27, 2015, 6:39pm

It would make sense to simply do a split to get that into the array, instead of doing all sorts of things with line by line.

Chris_Verberne · October 27, 2015, 8:08pm

Hello Michael,

Indeed I can do that also split in one line :

textRecord = textRecord.Left(textRecord.IndexOf(DecodeHex(“EFBFBD”).ToText))

But as an example and learning, it is easier to understand. That was the purpose of the example I made. If you watch Paul Lefebvre his webinars, many times he does exactly the same because it make things much more clear.

Thank you very much for your suggestion. Wish you all the best.

Jon_Ogden · October 28, 2015, 3:03pm

Chris,

Glad I was able to help point you in the right direction. I have had similar problems with odd characters being read and causing issues. I regularly see stuff like this when I open a tcp/ip connection to the console login for managed network switches. For some reason the initial set of characters that first get dumped to the socket are garbage - high value characters that you need to find using the binary tab in the debugger.