Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python generator to read large CSV file

I need to write a Python generator that yields tuples (X, Y) coming from two different CSV files.

It should receive a batch size on init, read line after line from the two CSVs, yield a tuple (X, Y) for each line, where X and Y are arrays (the columns of the CSV files).

I've looked at examples of lazy reading but I'm finding it difficult to convert them for CSVs:

  • Lazy Method for Reading Big File in Python?
  • Read large text files in Python, line by line without loading it in to memory

Also, unfortunately Pandas Dataframes are not an option in this case.

Any snippet I can start from?

Thanks

like image 455
d.grassi84 Avatar asked Jul 26 '16 08:07

d.grassi84


People also ask

How to read a large CSV file?

Using Lazy Generator In fact, since csv file is a line-based file, you can simply use open function to loop through the data, one line at a time. open function already returns a generator and does not load the entire file into memory. In this article, we have learnt different ways to read large CSV file.

Why read a CSV file in Python?

Reading a CSV is a very common use case as Python continues to grow in the data analytics community. Data is also growing and it’s now often the case that all the data folks are trying to work with, will not fit in memory. It’s also not always necessary to load all the data into memory.

How does CSV_reader () work in Python?

To answer this question, let’s assume that csv_reader () just opens the file and reads it into an array: This function opens a given file and uses file.read () along with .split () to add each line as a separate element to a list.

How to read&process CSV files one chunk at a time?

We use open keyword to open the file and use a for loop that runs as long as there is data to be read. In each iteration it simply prints the output of read_in_chunks function that returns one chunk of data. 3. Using iterators You may also use iterators to easily read & process csv or other files one chunk at a time. Here is an example. 4.


1 Answers

You can have a generator, that reads lines from two different csv readers and yield their lines as pairs of arrays. The code for that is:

import csv
import numpy as np

def getData(filename1, filename2):
    with open(filename1, "rb") as csv1, open(filename2, "rb") as csv2:
        reader1 = csv.reader(csv1)
        reader2 = csv.reader(csv2)
        for row1, row2 in zip(reader1, reader2):
            yield (np.array(row1, dtype=np.float),
                   np.array(row2, dtype=np.float)) 
                # This will give arrays of floats, for other types change dtype

for tup in getData("file1", "file2"):
    print(tup)
like image 106
jotasi Avatar answered Oct 28 '22 11:10

jotasi