I have just started delving into the world of Pandas, and the first strange CSV file I've found is one where there are two lines of comments (with different column widths) right at the beginning.
sometext, sometext2
moretext, moretext1, moretext2
*header*
actual data ---
---------------
I know how to skip these lines with skiprows or header=, but, instead, how would I retain these comments while using read_csv? Sometimes comments are necessary as file meta information, and I do not want to throw them away.
Pandas is designed to read structured data.
For unstructured data, just use the built-in open:
with open('file.csv') as f:
reader = csv.reader(f)
row1 = next(reader) # gets the first line
row2 = next(reader) # gets the second line
You can attach strings to the dataframe like this:
df.comments = 'My Comments'
But note:
Note, however, that while you can attach attributes to a DataFrame, operations performed on the DataFrame (such as groupby, pivot, join or loc to name just a few) may return a new DataFrame without the metadata attached. Pandas does not yet have a robust method of propagating metadata attached to DataFrames.
You can read first metadata and then use read_csv:
with open('f.csv') as file:
#read first 2 rows to metadata
header = [file.readline() for x in range(2)]
meta = [value.strip().split(',') for value in header]
print (meta)
[['sometext', ' sometext2'], ['moretext', ' moretext1', ' moretext2']]
df = pd.read_csv(file)
print (df)
*header*
0 actual data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With