The WordCounter Example

Matthew_Gardner · July 29, 2013, 3:47am

Hi all - I know this example project from the Xojo blog is meant as an illustration of certain concepts (threading vs. multi-core processing), however I just couldn’t help running it to check the results on my development machine (iMac, 3.4GHz i7, 16GB 1333MHz DDR3) - I like stats:
Sequential: 33.62 seconds
Threaded: 34.10 seconds
Processes: 15.01 seconds
wc -w: initially 0.05 seconds on the first run, then 0.004 seconds on the second (with an accurate word count)

Norman_P · July 29, 2013, 4:07am

Not surprising as wc is written in C and has a lot less overhead to start up

Paul_Lefebvre · July 29, 2013, 3:22pm

Heh. Yeah, as I said in the blog post:

The WordCount method is just looping character by character over the entire text. That’s pretty inefficient.

If I change it to Split the text on a space first and then count the words, it only takes about 0.3 seconds to count the words in all four files. But that doesn’t make a good multiprocessing test.

Kem_Tekinay · July 29, 2013, 4:48pm

Here is a regular expression that does it too, and this probably runs at least as fast (or faster) than anything else described here, BUT (as Paul pointed out), that wasn’t the point of the post.

(?x)			# FREE SPACING MODE -- whitespace ignored
#SUBROUTINE
(?(DEFINE)(?'good_chars'[^[:punct:]\\s[:cntrl:]']+))
			# Defines good_chars as a run of characters that are 
			# not a punctuation, whitespace, or control character
# PATTERN
(?:(?&good_chars)['](?&good_chars))
			# Run of good characters, followed by a single quote,
			# then another run of good characters
|			# OR
(?&good_chars)
			# Just a run of good characters

Mike_D · July 29, 2013, 5:21pm

The core routine inside the WordCounter app is this:

  Dim count As Integer
  Dim wordLength As Integer
  
  Dim char As String
  Dim textLen As Integer = mText.Len
  For i As Integer = 0 To textLen
    char = mText.Mid(i, 1)
    
    If IsVowel(char) Or IsConsonant(char) Then
      wordLength = wordLength + 1
    Else
      If wordLength >= 1 Then count = count + 1
    End If
  Next
  
  Return count

This is very un-optimized, and also I think contains a bug : shouldn’t wordLength be reset to zero everytime Count is incremented?

Here’s my super-optimized version with that bug fixed:

    #pragma BackgroundTasks False
    #pragma BoundsChecking False
    #pragma NilObjectChecking False
    #pragma StackOverflowChecking False
    
    // new optimized method
    Dim count As Integer
    Dim wordLength As Integer
    
    dim mMB as MemoryBlock = mText
    Dim char As integer
    dim p as ptr = mMB
    Dim textLen As Integer = mMB.Size-1
    For i As Integer = 0 To textLen
      char = p.Byte(i)
      ' see if character is in ASCII A..Z or a..z range
      if (char >= 65 and char <= 90) or (char >= 97 and char <= 122) then
        wordLength = wordLength + 1
      else
        if wordLength >=1 then 
          count = count +1
          wordLength = 0 // reset word length to zero
        end if
      end if
    Next
    
    Return count

Comparison:

Old code: 10.01 seconds
New code : 1.00 seconds

Kem_Tekinay · July 29, 2013, 6:11pm

I haven’t looked at the project, so I don’t know how it does Paul’s code does it, but your version would count something like “stmp” as two words, wouldn’t it?

Also, optimizing it like this wouldn’t have shown off the technique of using multiple processors well, would it?

Norman_P · July 29, 2013, 6:18pm

Its intended to be a demo of “how to do things using multiple processes” not “how to do this optimally”

Perhaps the biggest take away is it’s possible and that having an algorithm that IS amenable to using multiple processes is key.

Mike_D · July 29, 2013, 6:22pm

Both versions are extremely limited, in that they only work with ASCII - both are really ‘demo’ quality at best. Since this was, in fact a demo, that’s perfectly appropriate.

I’m mostly responding to the comment above explaining that the “wc” tool was written in C which is why it’s so much faster. While true that optimized C is pretty much always going to beat optimized xojo code, you can get a lot closer without a lot of extra effort.

Mike_D · July 29, 2013, 6:23pm

[quote=23393:@Norman Palardy]Its intended to be a demo of “how to do things using multiple processes” not “how to do this optimally”
Perhaps the biggest take away is it’s possible and that having an algorithm that IS amenable to using multiple processes is key.[/quote]

I agree - I just love optimization so I can’t resist when I see code that could be faster. Apologies if this took the thread off-topic

Kem_Tekinay · July 29, 2013, 6:25pm

Great, now I’m going to have to look at this to see how the regex compares. Thanks a lot!

Norman_P · July 29, 2013, 6:27pm

Thats not a problem and given some tweaks I’d bet it can get to be pretty quick

I just want others to know that this is NOT intended to be an example of how to do this task optimally and IS just a demo of the technique of using helper apps running on multiple cores to accomplish a task.

Mike_D · July 29, 2013, 6:46pm

I found another subtle issue : the code used in the GUI version of “.Words” is not exactly the same as the code in the Command-line tool, so as written the timing comparison is not a fair one (the GUI code is actually faster, as written).

Here’s my re-test using the exact same, optimized, code in both GUI and command-line apps. I loaded all of the PrincessOfMars chapters and ran the three modes, taking the average of 3 runs.

Sequential : 0.07 seconds
Threaded : 1.03 seconds
Processes: 1.00 seconds

Odd results: I suspect that with optimization, the actual time spent counting the words is so short that all we are seeing is the results of overhead of launching threads & shells. A better test would be with larger files, I presume.

N.B: here’s the link to the original blog post:
http://www.xojo.com/blog/en/2013/07/take-advantage-of-your-multi-core-processor.php

Kem_Tekinay · July 29, 2013, 6:59pm

FYI, using a the regex pattern described above via RegExMBS (part of the MonkeyBread plugins) counts the words in all four files in 0.11 seconds in the IDE. The native RegEx is significantly slower.

Private Function CountWordsRegExMBS(s As String) As Integer
  dim rx as new RegExMBS
  dim rxSearchPattern as string = _
  "(?xmi-Us)            # FREE SPACING MODE -- whitespace ignored" + EndOfLine + _
  "#SUBROUTINE" + EndOfLine + _
  "(?(DEFINE)(?'good_chars'[^[:punct:]\\s[:cntrl:]']+))" + EndOfLine + _
  "            # Defines good_chars as a run of characters that are " + EndOfLine + _
  "            # not a punctuation, whitespace, or control character" + EndOfLine + _
  "# PATTERN" + EndOfLine + _
  "(?:(?&good_chars)['](?&good_chars))" + EndOfLine + _
  "            # Run of good characters, followed by a single quote," + EndOfLine + _
  "            # then another run of good characters" + EndOfLine + _
  "|            # OR" + EndOfLine + _
  "(?&good_chars)" + EndOfLine + _
  "            # Just a run of good characters"
  rx.CompileOptionNewLineLF = true
  
  dim rxErrCode as integer
  dim rxErrMessage as string
  if not ( rx.Compile( rxSearchPattern ) and rx.Study() ) then
    rxErrCode = rx.Lasterror
    rxErrMessage = rx.ErrorMessage
  end if
  
  dim cnt as integer
  dim rxSubExpressionCount as integer
  rxSubExpressionCount = rx.Execute( s, 0 )
  while rxSubExpressionCount > 0
    cnt = cnt + 1
    rxSubExpressionCount = rx.Execute( rx.Offset( 1 ) )
  wend
  
  return cnt
End Function

Note that this will lead to a different count than the other procedures mentioned, but that’s besides the point.

Paul_Lefebvre · July 29, 2013, 7:51pm

It’s supposed to be unoptimized.

But yeah, there is a bug with the Console app using an older version of WordCount than the GUI app. Doesn’t really affect the premise, though and I’ve fixed it for the version that will be in 2013r3. In a real project, it would probably make sense to shared the class between the two projects using an external item so that they don’t get out of sync.

It seems to me that Mid is the slow-poke here, which worked out well for this simple demo. Using Split instead makes the word count take about a 1/3 second in my tests.

Mike_D · July 29, 2013, 11:34pm

Mid of course has the advantage that it’s unicode-savvy, whereas my technique of using Ptr.byte(i) is not. That’s probably also why Mid is so slow - must Mid() traverse the entire string from the start, in order to deal with the possibility of variable-length unicode sequences?

Enric_Herrera · December 24, 2013, 2:17pm

This example doesn’t work for a file in other language than english.
Is there a way to know what encoding has a file on opening it?

Norman_P · December 24, 2013, 10:45pm

[quote=54909:@Enric Herrera]This example doesn’t work for a file in other language than english.
Is there a way to know what encoding has a file on opening it?[/quote]
No - you can guess but that guess might not be right.
Files are just streams of bytes and there are many encodings that can use the same stream of bytes and represent different “characters”. Single byte encodings are famous for this. This is what some refer to as “extended ascii”. Since all 256 bytes in a single byte encoding are valid in all other single byte encodings you really have to guess which one is intended. Text will render properly but any of the non-ASCII values will possibly be represented wrong.

Multi byte encodings like UTF-8 could be read & treated as though they are single byte encodings (although the results will likely be jumbled) UTF-16 and UTF-32 (actually UCS4) would behave similarly.

Some file formats write an encoding marker at the beginning of the file to aid in discerning what encoding was used.

Kem_Tekinay · December 24, 2013, 10:59pm

The M_String package at my web site includes code to determine a string’s encoding via analysis, if that helps.

http://www.mactechnologies.com/index.php?page=downloads