Reading data from a file that contains text and binary data on each line?

John_Fatte · December 11, 2023, 6:58pm

I need to read some data from a file that contains mostly string data on each line, but each line includes a social security number which is encrypted into binary data. The problem is, I need to read the data after the encryption and the encrypted data (SSN) takes up a different length for each line of data in the file, so I can’t rely on the data after the SSN being in the same location for each employee.

I have the file layout and all the data before the social security number is in the same location, but after the social security number, the location of the data varies.

Any suggestions on how to approach this issue?

Gilles_Plante · December 11, 2023, 7:25pm

My guess is that there is a way to determine where the encoded SSN ends on the line from the data on the line. If not, no one will be able to read and decrypt the SSN.

For sure you need to read each line as binary data. Do you know where the SSN starts and the encryption used ?

Kem_Tekinay · December 11, 2023, 7:39pm

You’re in for a world of hurt. Any way to change the file format so the encrypted data is encoded either as hex or base 64?

John_Fatte · December 11, 2023, 8:04pm

Kem, you are correct. I am hurting with this one because everything I’ve tried has failed to work. Oddly, I’m converting this from a Visual Foxpro application that DID work.

Jeff_Tullin · December 11, 2023, 8:05pm

Sample of the file?
Do you have the code of the Foxpro app?

John_Fatte · December 11, 2023, 8:06pm

I know where the binary data starts, that’s a constant. But the only way I know to get past the binary data is to read each successive character until I get back to an ASCII character, then continue reading from there.

Rick_Araujo · December 11, 2023, 8:08pm

Are you dealing with fixed size records? Can you share some sample? You can use some hex/binary editor to cut a sample with few records and even change some confidential contents before sharing. Like “Mary Jane” → “Mary XXXX”

Christian_Wheel · December 11, 2023, 8:49pm

Having had to reverse-engineer a few binary formats for use in Xojo just in the last few months, I can confirm you are in for a world of hurt. A hex editor will be your best friend, and I found the Chilkat BinData and FileAccess plugins to offer a few beneficial shortcuts over using Xojo MemoryBlocks (which are still necessary in some cases).

But the only way I know to get past the binary data is to read each successive character until I get back to an ASCII character, then continue reading from there.

The catch with this approach is there’s no way to tell if the ASCII character is part of the binary data or not. The program that wrote these files must have some way of determining the binary length.

Often times the binary data has a 2 or 4 byte header containing the length of the record. Pull that apart in a hex editor and see if you can find any binary numeric data, checking both big and little endians for values that correspond with the length of the binary data as you can see it.

John_Fatte · December 11, 2023, 8:56pm

Data sample:

00014420141108000100000000000000000013000000000000000144201411140001346484PNEM 000096188+000068120+000068120+¬ƒ—’≈”‘””ƒƒƒ∞∞∞∞∞∞∞∞∞K1 00004809+000086079+00000000005635+

Christian_Wheel · December 12, 2023, 1:36am

To be honest, seeing the data in this form is not really helpful, as the forum software changes the formatting.

Can you upload a binary file containing a record or two with the surrounding data? You might need to use a hex editor to change the ascii data and protect the privacy of the individuals involved.

Rick_Araujo · December 12, 2023, 1:58am

like 3 records, 1 record is useless to analyze structure consistency.

Tim_Hare · December 12, 2023, 2:24am

Reverse engineering a data structure is so much fun!

An aside: make sure you’re working with the B versions of the string functions, so you’re dealing with bytes and not characters.

Rick_Araujo · December 12, 2023, 2:59am

No guaranties. But some are really simple, as fixed size records in one file and indexes in other file. If it is variant, it must have control fields, and you guys may suffer to find how to read the records correctly ( I won’t have time for that). Some could have separators, control chars, but having a binary field in the middle of the record would be weird, because coincidences could occur and data bytes could be confused with control chars. That’s why I’m curious about those unusual records said made of “lines”.

Andrew_Lambert · December 12, 2023, 3:10am

Are you sure about that? Length in characters is not the same as size in bytes; it’s not something you can see in a text editor.

The size wouldn’t need to be given if it’s always going to be the same, and SSNs are always the same size. So, unless the encryption is randomly adding padding I see no reason why it should be variable-length.