I need to create a data frame by reading in data from a file, using read_csv
method. However, the separators are not very regular: some columns are separated by tabs (\t
), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).
Is there a way to tell pandas to treat these files properly?
By the way, I do not have this problem if I use Python. I use:
for line in file(file_name): fld = line.split()
And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?
The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.
pandas. read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....) It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file.
The pandas DataFrame class supports serializing and de-serializing of CSV in an extenstive way through the read_csv() method. The read_csv() method of pandas DataFrame class reads a CSV file and loads each record as a row in the DataFrame.
Pandas provide predefine method “pandas. Series. str. replace()” to remove whitespace.
From the documentation, you can use either a regex or delim_whitespace
:
>>> import pandas as pd >>> for line in open("whitespace.csv"): ... print repr(line) ... 'a\t b\tc 1 2\n' 'd\t e\tf 3 4\n' >>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+") 0 1 2 3 4 0 a b c 1 2 1 d e f 3 4 >>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True) 0 1 2 3 4 0 a b c 1 2 1 d e f 3 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With