Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip rows with missing values in read_csv

Tags:

python

pandas

I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.

  1. Is it possible to simply skip rows with missing values? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM.
  2. Is it possible to convert missing values to some other I choose during the reading of the data?
like image 875
graffe Avatar asked Aug 07 '16 21:08

graffe


2 Answers

It would be dainty if you could fill NaN with say 0 during read itself. Perhaps a feature request in Pandas's git-hub is in order...

Using a converter function

However, for the time being, you can define your own function to do that and pass it to the converters argument in read_csv:

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

Note that converters takes a dict, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.

Also note that this might slow down your read_csv performance, depending on how the converters function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda function instead:

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

Reading in chunks

You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

Note that this method does not strictly duplicate data. There is a time when the data in chunk exists twice, right after the result.append statement, but only chunksize rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.

like image 117
Kartik Avatar answered Oct 23 '22 21:10

Kartik


There is no feature in Pandas that does that. You can implement it in regular Python like this:

import csv
import pandas as pd

def filter_records(records):
    """Given an iterable of dicts, converts values to int.
    Discards any record which has an empty field."""

    for record in records:
        for k, v in record.iteritems():
            if v == '':
                break
            record[k] = int(v)
        else: # this executes whenever break did not
            yield record

with open('t.csv') as infile:
    records = csv.DictReader(infile)
    df = pd.DataFrame.from_records(filter_records(records))

Pandas uses the csv module internally anyway. If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses).

like image 28
John Zwinck Avatar answered Oct 23 '22 19:10

John Zwinck