Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

I need to create a data frame by reading in data from a file, using read_csv method. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):    fld = line.split() 

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

like image 328
Roman Avatar asked Feb 22 '13 14:02

Roman


People also ask

What is the default separator in PD read_csv?

The default value of the sep parameter is the comma (,) which means if we don't specify the sep parameter in our read_csv() function, it is understood that our file is using comma as the delimiter.

How do you use the separator in pandas?

pandas. read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....) It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file.

How read comma separated CSV file in pandas?

The pandas DataFrame class supports serializing and de-serializing of CSV in an extenstive way through the read_csv() method. The read_csv() method of pandas DataFrame class reads a CSV file and loads each record as a row in the DataFrame.

How do I get rid of white space pandas?

Pandas provide predefine method “pandas. Series. str. replace()” to remove whitespace.


1 Answers

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd >>> for line in open("whitespace.csv"): ...     print repr(line) ...      'a\t  b\tc 1 2\n' 'd\t  e\tf 3 4\n' >>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")    0  1  2  3  4 0  a  b  c  1  2 1  d  e  f  3  4 >>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)    0  1  2  3  4 0  a  b  c  1  2 1  d  e  f  3  4 
like image 145
DSM Avatar answered Sep 23 '22 15:09

DSM