Optimal Way to Count a Large Amount of Repeating Data

Jon_Ogden · May 12, 2016, 2:28pm

Hey guys,

I am trying to figure out the best way to count the number of repeats of a fairly lengthy set of data received over an RS-232 connection. The data consists of the following:

| backspace - backspace / backspace \ backspace

It’s a VT-100 character sequence that looks somewhat like a flashing start in a terminal output. I know that I need to count a total of 16,256 of these sequences. So here’s what I have been doing in the DataAvailable event of the serial port object:

1.) Using a RegEx I check the buffer of the serial port for the pattern
2.) If the pattern is there, using the CountFields method, I count the number of occurrences.

My code is below

    dim rx as new RegEx
    rx.SearchPattern = "(?mi-Us)(\\|\\x08-\\x08/\\x08\\\\\\x08)+"
    dim rxOptions as RegExOptions = rx.Options
    rxOptions.LineEndType = 4
    
    dim match as RegExMatch = rx.Search( me.LookAhead(Encodings.ASCII))
    
    If match <> Nil Then
      Dim matches as integer = match.subexpressionstring(0).CountFields("|"+&u08+"-"+&u08+"/"+&u08+"\"+&u08)-1
      UpdateProgressBar(matches)
    End If

The problem I have seen (and only on Windows) is that sometimes, it looks like CountFields is not updating and getting the correct value of fields. Either that or the match.subexpressionstring(0) method is not returning the full string or something like that. The value of the matches variable ends up not changing. But I know the data is coming in because there’s other code in the DataAvailable event that detects when I am all done with this current sequence of events and the moves on to the next processing step. I’m just not always getting an accurate count of my fields.

So I’m wondering if there is a better way to do this. The length of the RS-232 buffer gets very long - 100s of thousands of characters.

And how efficient is CountFields for counting 16,000 some fields?

But it doesn’t happen all the time. i just added some debug log statements to see what I was getting and it worked just fine. I need to do several more runs to see if I can narrow down what is happening.

In OS X it seems to work fine and I never have an issue.

Kem_Tekinay · May 12, 2016, 2:41pm

You are using Unicode code points to check bytes. &u8 corresponds to byte value 8 only in UTF-8. Instead, you should be checking the bytes directly. Try this:

dim bs as string = ChrB( 8 )
Dim matches as integer = match.subexpressionstring( 0 ).CountFieldsB( "|" + bs + "-" + bs + "/" + bs + "\" + bs ) - 1

BTW, I’m not sure why you need the regex at all here. Can’t you use CountFieldsB alone?

Jon_Ogden · May 12, 2016, 8:36pm

[quote=265536:@Kem Tekinay]You are using Unicode code points to check bytes. &u8 corresponds to byte value 8 only in UTF-8. Instead, you should be checking the bytes directly. Try this:

dim bs as string = ChrB( 8 )
Dim matches as integer = match.subexpressionstring( 0 ).CountFieldsB( "|" + bs + "-" + bs + "/" + bs + "\" + bs ) - 1

BTW, I’m not sure why you need the regex at all here. Can’t you use CountFieldsB alone?[/quote]

Thanks Kem. I thought about needing the RegEx this morning. I use other RegEx’s at various points throughout the DataAvailable event but yeah, I might not need it here since the pattern detection is being handled by CountFields.

Now that you say it, I get why I would what ti check the bytes directly. Good point. In my case I should always be reading ASCII data and so &u08 and chrB(8) are really the same - no? But would using ChrB(8) as opposed to &u08 solve the issue of why it sometimes appear that CountFields is not returning the correct value? I would expect it then to never work. But it seems at random as to it working or not working.

Kem_Tekinay · May 12, 2016, 9:28pm

I can’t say, but no &u8 is not the same as ChrB( 8 ). The former represents a Unicode code point and expects to be represented in a string according to that string’s encoding. With UTF-8 encoding, it will be represented by a single byte, \x08, but in UTF-16 it could be \x0008 or \x0800.

But you are working with binary data, not text, so you should use the appropriate functions, ChrB and CountFieldsB.

Norman_P · May 12, 2016, 9:30pm

Or the data is incomplete and the closing bytes for the sequence havent arrived yet

LangueR · May 13, 2016, 12:32am

On a serial monitor for a test equipment we use, I created a monitor which needed to count many instances (to determine failure rates). We would leave the monitor running over the weekend and I ended up saving a log file and then running a post-process once the capture completed (all within the same program). The program also monitored (real-time) the failure rate and I found that I would miss some of the info only on the real-time monitor; the post-process log was always spot on. So I left the real-time monitor for immediate feedback, but the final count only happened after the post-process completed. This worked well with really long sequences (like 1 very long string per second, continuously, for 2-3 days straight). Just a thought.

Brian_O_Brien · May 13, 2016, 1:21am

correct me if I’m wrong but if you are receiving the data real time why not just parse it one character at a time?
Isn’t there a data available event for the rs232 stream?

Jon_Ogden · May 13, 2016, 1:43pm

[quote=265695:@Brian O’Brien]correct me if I’m wrong but if you are receiving the data real time why not just parse it one character at a time?
Isn’t there a data available event for the rs232 stream?[/quote]

Yes, but data doesn’t come in one character at a time. It comes in chunks. And I’m looking for a specific set of characters - in this case the sequence that I defined above. All this is taking place in the DataAvailable event…

Jon_Ogden · May 13, 2016, 1:44pm

But that’s just it - it does and it eventually finishes correctly. The whole process I am monitoring takes about 5 minutes. If my field counting “freezes”, I can quit my app and go view everything an terminal program and I see the data coming through just fine. That’s why I am scratching my head. I am going to try Kem’s suggestion this morning.

Kem_Tekinay · May 13, 2016, 1:51pm

I think Norman is suggesting that the arriving data may be split in the middle of your sequence. You should buffer the last n-1 characters each time through, then prefix the current data with the buffer before looking for your sequence. (n is the length of your sequence.)

Jon_Ogden · May 13, 2016, 1:53pm

I understand. However, I’m not pulling the data from the RS-232 buffer using Read. I’m only using LookAhead. So I’ve got a built in buffer. If it’s split in the middle it should then get the additional characters the next time through. Right? Or am I missing something.

Kem_Tekinay · May 13, 2016, 1:59pm

No, I did.

Looking at your code again, don’t specify an encoding. The stream is a series of bytes, not text, right? Using ascii might change some of the values.

Jon_Ogden · May 13, 2016, 2:06pm

Well, I guess you could always say the stream is a series of bytes because that’s what it is at the fundamental level.

But for the device I am connecting to, it is ASCII console port. So this particular sequence is a VT-100 type sequence that creates the illusion of a rotating start or progress wheel, etc. So it sends the | - / \ characters with a backspace sequence after each.

I did just try your idea of eliminating the RegEx and using CHRB instead. It worked great. I need to run it several more times to see if I can get my percentage count to hang.