I have an input file with known columns, let's say two columns Name
and Sex
. Sometimes it has the header line Name,Sex
, and sometimes it doesn't:
1.csv:
Name,Sex
John,M
Leslie,F
2.csv:
John,M
Leslie,F
Knowing the identity of the columns beforehand, is there a nice way to handle both cases with the same read_csv
command? Basically, I want to specify names=['Name', 'Sex']
and then have it infer header=0
only when the header is there. Best I can come up with is:
1) Read the first line of the file before doing read_csv
, and set
parameters appropriately.
2) Just do df = pd.read_csv(input_file, names=['Name', 'Sex'])
,
then check whether the zeroeth row is identical to the header, and if
so drop it (and then maybe have to renumber the rows).
But this doesn't seem like that unusual of a use case to me. Is there a built-in way of doing this with read_csv
that I haven't thought of?
Sniffer(). has_header(csv_test_bytes) # Check to see if there's a header in the file. dialect = csv. Sniffer().
To read CSV file without header, use the header parameter and set it to “None” in the read_csv() method.
How do I remove a header from a Dataframe in Python? Just simply put header=False and for eliminating the index using index=False. If you want to learn more about Pandas then visit this Python Course designed by industrial experts.
Pandas to CSV without Header To write DataFrame to CSV without column header (remove column names) use header=False param on to_csv() method.
using new feature - selection by callable:
cols = ['Name','Sex']
df = (pd.read_csv(filename, header=None, names=cols)
[lambda x: np.ones(len(x)).astype(bool)
if (x.iloc[0] != cols).all()
else np.concatenate([[False], np.ones(len(x)-1).astype(bool)])]
)
using .query() method:
df = (pd.read_csv(filename, header=None, names=cols)
.query('Name != "Name" and Sex != "Sex"'))
i'm not sure that this is the most elegant way, but this should work as well:
df = pd.read_csv(filename, header=None, names=cols)
if (df.iloc[0] == cols).all():
df = df[1:].reset_index(drop=True)
I've come up with a way of detecting the header without prior knowledge of its names:
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
df = df[1:].reset_index(drop=True)
And by changing it slightly, it can update the current header with the detected one:
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
df = df[1:].reset_index(drop=True).rename(columns=df.iloc[0])
This would allow easily selecting the desired behavior:
update_header = True
if any(df.iloc[0].apply(lambda x: isinstance(x, str))):
new_header = df.iloc[0]
df = df[1:].reset_index(drop=True)
if update_header:
df.rename(columns=new_header, inplace=True)
Pros:
Cons:
if any()
to require all elements to be strings might help, unless data also contains entire rows of strings.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With