I am looking to read a large CSV into a dataframe with the additional constraint that I want to fail early if certain columns are missing (since the input is not as expected), but I do want all the columns, not just the required columns, to be included in the Dataframe. In pandas.read_csv, it seems that I can use the usecols argument if I want to specify the subset of columns to read in, but the only obvious way I can see to check which columns will be in a dataframe I'm about to read is to actually read the file.
I've made a working first-pass version that reads the dataframe as an iterator, gets the first line, checks that the columns exist, then reads the file with the normal arguments:
import pandas as pd
from io import StringIO
class MissingColumnsError(ValueError):
pass
def cols_enforced_reader(*args, cols_must_exist=None, **kwargs):
if cols_must_exist is not None:
# Read the first line of the DataFrame and check the columns
new_kwargs = kwargs.copy()
new_kwargs['iterator'] = True
new_kwargs['chunksize'] = 1
if len(args):
filepath_or_buffer = args[0]
args = args[1:]
else:
filepath_or_buffer = new_kwargs.get('filepath_or_buffer', None)
df_iterator = pd.read_csv(filepath_or_buffer, *args, **new_kwargs)
c = next(df_iterator)
if not all(col in c.columns for col in cols_must_exist):
raise MissingColumnsError('Some required columns were missing!')
seek = getattr(filepath_or_buffer, 'seek', None)
if seek is not None:
if filepath_or_buffer.seekable():
filepath_or_buffer.seek(0)
return pd.read_csv(filepath_or_buffer, *args, **kwargs)
in_csv = """col1,col2,col3\n0,1,2\n3,4,5\n6,7,8"""
# Should succeed
df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col1'])
print('First call succeeded as expected.')
# Should fail
try:
df = cols_enforced_reader(StringIO(in_csv), cols_must_exist=['col7'])
except MissingColumnsError:
print('Second call failed as expected.')
This feels a bit messy to me and doesn't really handle all possible inputs for filepath_or_buffer (non-seekable streams, for example, or buffers where I'm not supposed to start at 0). Obviously I can tweak what I have here to my specific use case for the moment and be done with it, but I'm wondering if there's a more elegant way to do this (preferably just using standard pandas functions) that works in general.
You could just read one row and test if all required columns are present on that? For example:
import pandas as pd
required_cols = ['col1', 'col2']
cols = pd.read_csv('input.csv', nrows=1).columns
if all(req in cols for req in required_cols):
print pd.read_csv('input.csv')
else:
print "Columns missing"
To do this via a stream, an alternative approach would be to read it via a csv.reader(), this is compatible with itertools.tee():
import pandas as pd
from itertools import tee
import csv
required_cols = ['col1', 'col2']
with open('input.csv') as f_input:
csv_input = csv.reader(f_input)
csv_stream1, csv_stream2 = tee(csv_input, 2)
header = next(csv_stream1)
if all(req in header for req in required_cols):
df = pd.DataFrame(list(csv_stream2)[1:], columns=header)
print(df)
else:
print("Columns missing")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With