Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running "wc -l <filename>" within Python Code

Tags:

python

I want to do 10-fold cross-validation for huge files ( running into hundreds of thousands of lines each). I want to do a "wc -l " each time i start reading a file, then generate random numbers a fixed number of times, each time writing that line number into a separate file . I am using this:

import os 
for i in files:
    os.system("wc -l <insert filename>").

How do I insert the file name there. Its a variable. I went through the documentation but they mostly list out ls commands, something that doesn't have this problem.

like image 504
crazyaboutliv Avatar asked Jun 29 '11 12:06

crazyaboutliv


3 Answers

Let's compare:

from subprocess import check_output

def wc(filename):
    return int(check_output(["wc", "-l", filename]).split()[0])

def native(filename):
    c = 0
    with open(filename) as file:
        while True:
            chunk = file.read(10 ** 7)
            if chunk == "":
                return c
            c += chunk.count("\n")

def iterate(filename):
    with open(filename) as file:
        for i, line in enumerate(file):
            pass
        return i + 1

Go go timeit function!

from timeit import timeit
from sys import argv

filename = argv[1]

def testwc():
    wc(filename)

def testnative():
    native(filename)

def testiterate():
    iterate(filename)

print "wc", timeit(testwc, number=10)
print "native", timeit(testnative, number=10)
print "iterate", timeit(testiterate, number=10)

Result:

wc 1.25185894966
native 2.47028398514
iterate 2.40715694427

So, wc is about twice as fast on a 150 MB compressed files with ~500 000 linebreaks, which is what I tested on. However, testing on a file generated with seq 3000000 >bigfile, I get these numbers:

wc 0.425990104675
native 0.400163888931
iterate 3.10369205475

Hey look, python FTW! However, using longer lines (~70 chars):

wc 1.60881590843
native 3.24313092232
iterate 4.92839002609

So conclusion: it depends, but wc seems to be the best bet allround.

like image 64
Lauritz V. Thaulow Avatar answered Dec 03 '22 23:12

Lauritz V. Thaulow


import subprocess
for f in files:
    subprocess.call(['wc', '-l', f])

Also have a look at http://docs.python.org/library/subprocess.html#convenience-functions - for example, if you want to access the output in a string, you'll want to use subprocess.check_output() instead of subprocess.call()

like image 30
ThiefMaster Avatar answered Dec 04 '22 00:12

ThiefMaster


Here is a Python approach I found to solve this problem:

count_of_lines_in_any_textFile = sum(1 for l in open('any_textFile.txt'))
like image 20
user6316035 Avatar answered Dec 04 '22 00:12

user6316035