I need to create a data frame by reading in data from a file, using <code>read_csv</code> method. However, the separators are not very regular: some columns are separated by tabs (<code>\t</code>), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space). Is there a way to tell pandas to treat these files properly? By the way, I do not have this problem if I use Python. I use: <pre class="prettyprint"><code>for line in file(file_name): fld = line.split() </code></pre> And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

From the documentation, you can use either a regex or <code>delim_whitespace</code>: <pre class="prettyprint"><code>>>> import pandas as pd >>> for line in open("whitespace.csv"): ... print repr(line) ... 'a\t b\tc 1 2\n' 'd\t e\tf 3 4\n' >>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+") 0 1 2 3 4 0 a b c 1 2 1 d e f 3 4 >>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True) 0 1 2 3 4 0 a b c 1 2 1 d e f 3 4 </code></pre>

How to make separator in pandas read_csv more flexible wrt whitespace, for irregular separators?

Tags:

python

pandas

dataframe

csv

whitespace

I need to create a data frame by reading in data from a file, using read_csv method. However, the separators are not very regular: some columns are separated by tabs (\t), other are separated by spaces. Moreover, some columns can be separated by 2 or 3 or more spaces or even by a combination of spaces and tabs (for example 3 spaces, two tabs and then 1 space).

Is there a way to tell pandas to treat these files properly?

By the way, I do not have this problem if I use Python. I use:

for line in file(file_name):    fld = line.split()

And it works perfect. It does not care if there are 2 or 3 spaces between the fields. Even combinations of spaces and tabs do not cause any problem. Can pandas do the same?

328

asked Feb 22 '13 14:02

Roman

1 Answers

From the documentation, you can use either a regex or delim_whitespace:

>>> import pandas as pd >>> for line in open("whitespace.csv"): ...     print repr(line) ...      'a\t  b\tc 1 2\n' 'd\t  e\tf 3 4\n' >>> pd.read_csv("whitespace.csv", header=None, delimiter=r"\s+")    0  1  2  3  4 0  a  b  c  1  2 1  d  e  f  3  4 >>> pd.read_csv("whitespace.csv", header=None, delim_whitespace=True)    0  1  2  3  4 0  a  b  c  1  2 1  d  e  f  3  4

145

answered Sep 23 '22 15:09

DSM

Related questions
                            
                                Authenticating against active directory using python + ldap
                            
                                How to specify multiple author(s) / email(s) in setup.py
                            
                                How does a Python set([]) check if two objects are equal? What methods does an object need to define to customise this?
                            
                                assertAlmostEqual in Python unit-test for collections of floats
                            
                                Label python data points on plot
                            
                                Difference between data and json parameters in python requests package
                            
                                Python: confusions with urljoin
                            
                                Python "from [dot]package import ..." syntax [duplicate]
                            
                                SQLAlchemy classes across files
                            
                                When to close cursors using MySQLdb
                            
                                ValueError: could not broadcast input array from shape (224,224,3) into shape (224,224)
                            
                                Pandas version of rbind
                            
                                Pandas "Can only compare identically-labeled DataFrame objects" error
                            
                                Correct approach to validate attributes of an instance of class
                            
                                How to fix pylint logging-not-lazy? [duplicate]
                            
                                How to determine if a number is any type of int (core or numpy, signed or not)?
                            
                                Django return redirect() with parameters
                            
                                How do I sort a list of datetime or date objects?
                            
                                Updating value in iterrow for pandas
                            
                                Python: generator expression vs. yield

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With