Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing newline counts speed between wc and Smalltalk

I am comparing performance for reading how many lines contains a file.

I did it first using the wc command line tool:

$ time wc -l bigFile.csv
1673820 bigFile.csv

real    0m0.157s
user    0m0.124s
sys     0m0.062s

and then in a clean Pharo Core Smalltalk latest 1.3

| file lineCount |
Smalltalk garbageCollect.
( Duration milliSeconds: [ file := FileStream readOnlyFileNamed: 'bigFile.csv'.
lineCount := 0.
[ file atEnd ] whileFalse: [
    file nextLine.
    lineCount := lineCount + 1 ].
file close.
lineCount. ] timeToRun ) asSeconds. 
15

How can I speed up the Smalltalk code to be faster or closer than the wc performance?

like image 526
Juan Aguerre Avatar asked Nov 07 '11 18:11

Juan Aguerre


2 Answers

[ (PipeableOSProcess waitForCommand: 'wc -l /path/to/bigfile2.csv') output ] timeToRun.

The above reports ~207 milliseconds, where time reported:

real    0m0.160s
user    0m0.131s
sys     0m0.029s

I'm kidding, but also serious. No need to reinvent the wheel. FFI, OSProcess, Zinc, etc. provide ample opportunity to utilize things like UNIX utilities that have been battle-tested over decades.

If your question was really more about Smalltalk itself, a start would be:

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | | endings count |
        count := 0.
        file binary.
        file contents do: [ :c | c = 10 ifTrue: [ count := count + 1 ] ].
        count ]
] timeToRun.

That will get you down to 2.5 seconds:

  • making the stream binary saved ~10 seconds
  • readOnlyFileNamed:do: saved ~1 second
  • finding the line endings manually instead of using #nextLine saved ~4 seconds

A cleaner, but 1/2 second longer op would be:

file contents occurrencesOf: 10.

Of course, if better performance is needed, and you don't want to use FFI/OSProcess, you would then write a plugin.

like image 55
Sean DeNigris Avatar answered Sep 22 '22 06:09

Sean DeNigris


If you can afford reading the whole file in memory, then the simplest code is

[ FileStream 
    readOnlyFileNamed: '/path/to/reallybigfile2.csv'
    do: [ :file | file contents lineCount ]
] timeToRun.

This will handle the zoo of LF (Linux), CR (Old Mac), CR-LF (you name it). The code from Sean only handles LF, for approximately the same cost. I'd say a factor 10 for Smalltalk vs C is expected for such basic operations, so I doubt you get much more efficiency without adding your own primitives.

like image 42
aka.nice Avatar answered Sep 21 '22 06:09

aka.nice