The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
read_csv(chunksize) One way to process large files is to read the entries in chunks of reasonable size, which are read into the memory and are processed before reading the next chunk. We can use the chunk size parameter to specify the size of the chunk, which is the number of lines.
Assuming no header in the CSV file:
import pandas import random n = 1000000 #number of records in file s = 10000 #desired sample size filename = "data.txt" skip = sorted(random.sample(range(n),n-s)) df = pandas.read_csv(filename, skiprows=skip)
would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.
With header and unknown file length:
import pandas import random filename = "data.txt" n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header) s = 10000 #desired sample size skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list df = pandas.read_csv(filename, skiprows=skip)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With