Encodings again - Russian

I probably have the most trouble getting different languages into my program using the same code.
Is there anything special I should be doing to import a Russian text file that would be different to the 6 other languages that I import?
The text file looks fine in BBEdit as UTF8 but when I import I get the mystery ? В Направле�

textInput = TextInputStream.Open(f)
textinput.Encoding =  Encodings.UTF8

Your file looks correct as being UTF-8 with some Russian text.
Maybe it’s your code?

1 Like
Dim f As New FolderItem("/Users/cs/Downloads/Russian_Short.txt", FolderItem.PathModes.Native)
Dim t As TextInputStream = TextInputStream.Open(f)
Dim s As String = t.ReadLine

Break

shows me

1 Like

Thanks Christian for confirming, it did seem weird that the other 5 imports are working but not the Russian one. I am adding to a dictionary and must be messing up there.

I think it is tied to me encrypting and decrypting the file. Does that look right? If I don’t encrypt the file it loads correctly as you found above

var d as string 
d = DefineEncoding(s,Encodings.UTF8)
tout.writeline encryptAES(d, "RUSH")
s=decryptAES(rowFromFile, "RUSH") // data, key
rowFromFile=DefineEncoding(s,Encodings.UTF8)
dim temp as string
dim pad as integer

aESpreviousCipherText = ""

temp = decrypt_AES128(data, key)

//padding
pad=asc(right(temp,1))
if pad<16 then
  temp=mid(temp,1,len(temp)-pad)
end if
//

return temp
dim d as string
dim temp as string
dim toDecrypt as string
dim output as string
dim t1, t2, t as integer
dim s as string
dim i, n,j as integer

d=decodeBase64MBS(data)

n=Ceil(lenB(d)/16)

for i=1 to n
  
  toDecrypt = mid(d, 1+16*(i-1), 16)
  
  temp=decrypt_AES128_16bytes(toDecrypt, key )
  
  
  //In the cipher-block chaining (CBC) mode, each block of plaintext
  // is XORed with the previous ciphertext block before being encrypted.
  if i>1 then
    
    s=""
    for j=1 to 16
      t1=ascb(midb(temp,j,1))
      t2=ascb(midb(AESpreviousCipherText,j,1))
      T=bitwise.bitXor(t1,t2) //XOR
      s=s+CHRb(t)
    next
    temp=s
    
  end if
  //-----------------------------------------------------------------------------------
  
  
  aESpreviousCipherText = toDecrypt
  output=output+temp
next

return output

After quite a few hours of retesting the other language files, Russian is the only one that doesn’t encrypt and decrypt. Is there special rules for Russian?

They use cyrillic characters, beside that…

Cyrillic and Cyrillic supplement are part of Unicode and thus can be encoded as UTF-8. So what’s the problem?

I have the text file sample above Russian_Short.txt and I can read that into the program fine but when I convert it to an encrypted file and save it out as Russian_Short_Encrypt.txt for distribution with my app it doesn’t show the text properly when I read that file back in and decrypt it.

Are you using writeline() for binary content?

Yes, I was converting 2 arrays line by line

var d as string 
d = DefineEncoding(s,Encodings.UTF8)
tout.writeline encryptAES(d, "RUSH")

It will break your contents. You should convert binary to a valid set of chars before writing them “as lines”, the easy way is converting such data to Base64. Your encoded binary could, for example, end with control bytes like 0x0a in the middle of it breaking your lines and other several side effects.

2 Likes

Is there another way to protect my data files?

Stuff them into an encrypted SQLite database.
Then you could have a table with all the texts in fields.

2 Likes

Thought I might be heading that way, a bit more of a longer rewrite but probably worth it. Thanks

Obfuscate the password used on it too. Any hacker will find your password. But at least not any person opening your executable in a text editor.

1 Like

The Zaz is your friend: https://thezaz.com/code/obfuscate

2 Likes

Thanks David that solves one of my problems :slight_smile:

Using that code, just change the name of the function from ObfuscatedString() to something more cryptic as “ban()” (for bananas(), not banning, banner, etc :rofl: ) or something, and add comments to your code. That name (Obfuscate…) is a honey pot for someone looking for sensitive data and Xojo usually exposes all your symbols.

2 Likes