I think I misunderstand the intention of read_csv. If I have a file 'j' like <pre class="prettyprint"><code># notes a,b,c # more notes 1,2,3 </code></pre> How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error <pre class="prettyprint"><code>df = pandas.read_csv('j', comment='#') </code></pre> CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3 I'm currently on <pre class="prettyprint"><code>In [15]: pandas.__version__ Out[15]: '0.12.0rc1' </code></pre> On version'0.12.0-199-g4c8ad82': <pre class="prettyprint"><code>In [43]: df = pandas.read_csv('j', comment='#', header=None) </code></pre> CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

So I believe in the latest releases of pandas (version 0.16.0), you could throw in the <code>comment='#'</code> parameter into <code>pd.read_csv</code> and this should skip commented out lines. These github issues shows that you can do this: <ul> <li>https://github.com/pydata/pandas/issues/10548</li> <li>https://github.com/pydata/pandas/issues/4623</li> </ul> See the documentation on <code>read_csv</code>: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

One workaround is to specify skiprows to ignore the first few entries: <pre class="prettyprint"><code>In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3' In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1) Out[12]: a b c 0 NaN NaN NaN 1 1 2 3 </code></pre> Otherwise <code>read_csv</code> gets a little confused: <pre class="prettyprint"><code>In [13]: pd.read_csv(StringIO(s), sep=',', comment='#') Out[13]: Unnamed: 0 a b c NaN NaN NaN 1 2 3 </code></pre> This seems to be the case in 0.12.0, I've filed a bug report. As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely): <pre class="prettyprint"><code>In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all') Out[14]: a b c 1 1 2 3 </code></pre> Note: the default index will "give away" the fact there was missing data.

pandas.read_csv: how to skip comment lines

Tags:

python

pandas

I think I misunderstand the intention of read_csv. If I have a file 'j' like

# notes a,b,c # more notes 1,2,3

How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error

df = pandas.read_csv('j', comment='#')

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

I'm currently on

In [15]: pandas.__version__ Out[15]: '0.12.0rc1'

On version'0.12.0-199-g4c8ad82':

In [43]: df = pandas.read_csv('j', comment='#', header=None)

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

379

asked Aug 21 '13 20:08

mathtick

2 Answers

So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#' parameter into pd.read_csv and this should skip commented out lines.

These github issues shows that you can do this:

https://github.com/pydata/pandas/issues/10548
https://github.com/pydata/pandas/issues/4623

See the documentation on read_csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

181

answered Oct 09 '22 08:10

hlin117

One workaround is to specify skiprows to ignore the first few entries:

In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'  In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1) Out[12]:      a   b   c 0 NaN NaN NaN 1   1   2   3

Otherwise read_csv gets a little confused:

In [13]: pd.read_csv(StringIO(s), sep=',', comment='#') Out[13]:          Unnamed: 0 a   b            c NaN NaN        NaN 1   2            3

This seems to be the case in 0.12.0, I've filed a bug report.

As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):

In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all') Out[14]:     a  b  c 1  1  2  3

Note: the default index will "give away" the fact there was missing data.

answered Oct 09 '22 09:10

Andy Hayden

Related questions
                            
                                What does `__import__('pkg_resources').declare_namespace(__name__)` do?
                            
                                Encrypted and secure docker containers
                            
                                Python Naming Conventions for Dictionaries/Maps/Hashes
                            
                                How to set ForeignKey in CreateView?
                            
                                How does the "number of workers" parameter in PyTorch dataloader actually work?
                            
                                What's the difference between the square bracket and dot notations in Python?
                            
                                advanced string formatting vs template strings
                            
                                ImportError: DLL load failed: The specified module could not be found
                            
                                Difference between methods and functions, in Python compared to C++
                            
                                SQLAlchemy cannot find a class name
                            
                                startswith TypeError in function
                            
                                How to create a numpy array of lists?
                            
                                pytz and astimezone() cannot be applied to a naive datetime
                            
                                How to create a temporary file that can be read by a subprocess?
                            
                                Python: create dictionary using dict() with integer keys?
                            
                                matplotlib - extracting data from contour lines
                            
                                Complexity of list.index(x) in Python
                            
                                Pandas: Modify a particular level of Multiindex
                            
                                What does In [*] in IPython Notebook mean and how to turn it off?
                            
                                How can I protect myself from a zip bomb?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With