Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a small random sample from a big CSV file into a Python data frame

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

like image 690
P.Escondido Avatar asked Mar 07 '14 19:03

P.Escondido


People also ask

Can Python read large CSV files?

read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.


1 Answers

Assuming no header in the CSV file:

import pandas import random  n = 1000000 #number of records in file s = 10000 #desired sample size filename = "data.txt" skip = sorted(random.sample(range(n),n-s)) df = pandas.read_csv(filename, skiprows=skip) 

would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

With header and unknown file length:

import pandas import random  filename = "data.txt" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 10000 #desired sample size skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip) 
like image 149
dlm Avatar answered Oct 07 '22 08:10

dlm