Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impose required columns constraint in pandas read_csv

I am looking to read a large CSV into a dataframe with the additional constraint that I want to fail early if certain columns are missing (since the input is not as expected), but I do want all the columns, not just the required columns, to be included in the Dataframe. In pandas.read_csv, it seems that I can use the usecols argument if I want to specify the subset of columns to read in, but the only obvious way I can see to check which columns will be in a dataframe I'm about to read is to actually read the file.

I've made a working first-pass version that reads the dataframe as an iterator, gets the first line, checks that the columns exist, then reads the file with the normal arguments:

import pandas as pd
from io import StringIO

class MissingColumnsError(ValueError):
    pass

def cols_enforced_reader(*args, cols_must_exist=None, **kwargs):
    if cols_must_exist is not None:
        # Read the first line of the DataFrame and check the columns
        new_kwargs =  kwargs.copy()
        new_kwargs['iterator'] = True
        new_kwargs['chunksize'] = 1

        if len(args):
            filepath_or_buffer = args[0]
            args = args[1:]
        else:
            filepath_or_buffer = new_kwargs.get('filepath_or_buffer', None)

        df_iterator = pd.read_csv(filepath_or_buffer, *args, **new_kwargs)

        c = next(df_iterator)
        if not all(col in c.columns for col in cols_must_exist):
            raise MissingColumnsError('Some required columns were missing!')

        seek = getattr(filepath_or_buffer, 'seek', None)
        if seek is not None:
            if filepath_or_buffer.seekable():
                filepath_or_buffer.seek(0)

    return pd.read_csv(filepath_or_buffer, *args, **kwargs)

in_csv = """col1,col2,col3\n0,1,2\n3,4,5\n6,7,8"""

# Should succeed
df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col1'])
print('First call succeeded as expected.')

# Should fail
try:
    df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col7'])
except MissingColumnsError:
    print('Second call failed as expected.')

This feels a bit messy to me and doesn't really handle all possible inputs for filepath_or_buffer (non-seekable streams, for example, or buffers where I'm not supposed to start at 0). Obviously I can tweak what I have here to my specific use case for the moment and be done with it, but I'm wondering if there's a more elegant way to do this (preferably just using standard pandas functions) that works in general.

like image 657
Paul Avatar asked Dec 29 '25 13:12

Paul


1 Answers

You could just read one row and test if all required columns are present on that? For example:

import pandas as pd

required_cols = ['col1', 'col2']
cols = pd.read_csv('input.csv', nrows=1).columns

if all(req in cols for req in required_cols):
    print pd.read_csv('input.csv')
else:
    print "Columns missing"

To do this via a stream, an alternative approach would be to read it via a csv.reader(), this is compatible with itertools.tee():

import pandas as pd
from itertools import tee
import csv

required_cols = ['col1', 'col2']

with open('input.csv') as f_input:
    csv_input = csv.reader(f_input)
    csv_stream1, csv_stream2 = tee(csv_input, 2)
    header = next(csv_stream1)

    if all(req in header for req in required_cols):
        df = pd.DataFrame(list(csv_stream2)[1:], columns=header)
        print(df)
    else:
        print("Columns missing")
like image 116
Martin Evans Avatar answered Jan 01 '26 02:01

Martin Evans