Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas.read_csv: how to skip comment lines

Tags:

python

pandas

I think I misunderstand the intention of read_csv. If I have a file 'j' like

# notes a,b,c # more notes 1,2,3 

How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error

df = pandas.read_csv('j', comment='#') 

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

I'm currently on

In [15]: pandas.__version__ Out[15]: '0.12.0rc1' 

On version'0.12.0-199-g4c8ad82':

In [43]: df = pandas.read_csv('j', comment='#', header=None) 

CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3

like image 379
mathtick Avatar asked Aug 21 '13 20:08

mathtick


People also ask

What does parse_dates do in read_csv?

By default, date columns are represented as object when loading data from a CSV file. To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

What is Index_col in read_csv?

index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.

What is delimiter in read_csv?

One of the optional parameters in read_csv() is sep, a shortened name for separator. This operator is the delimiter we talked about before. This sep parameter tells the interpreter, which delimiter is used in our dataset or in Layman's term, how the data items are separated in our CSV file.

How to skip rows when reading a CSV file into pandas Dataframe?

You can use the following methods to skip rows when reading a CSV file into a pandas DataFrame: #import DataFrame and skip 2nd row df = pd.read_csv('my_data.csv', skiprows= [2]) #import DataFrame and skip 2nd and 4th row df = pd.read_csv('my_data.csv', skiprows= [2, 4])

How to read a file and Skip initial comment lines in Python?

A naive way to read a file and skip initial comment lines is to use “if” statement and check if each line starts with the comment character “#”. Python string has a nice method “startswith” to check if a string, in this case a line, starts with specific characters. For example, “#comment”.startswith (“#”) will return TRUE.

How to skip a row in a CSV file?

In combination of parameters header and skiprows - first the rows will be skipped and then first on of the remaining will be used as a header. In the example below 3 rows from the CSV file will be skipped.

What are the parameters to be considered while reading a CSV?

Some useful parameters are given below : Method 1: Skipping N rows from the starting while reading a csv file. Method 2: Skipping rows at specific positions while reading a csv file. Method 3: Skipping N rows from the starting except column names while reading a csv file.


2 Answers

So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#' parameter into pd.read_csv and this should skip commented out lines.

These github issues shows that you can do this:

  • https://github.com/pydata/pandas/issues/10548
  • https://github.com/pydata/pandas/issues/4623

See the documentation on read_csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

like image 181
hlin117 Avatar answered Oct 09 '22 08:10

hlin117


One workaround is to specify skiprows to ignore the first few entries:

In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'  In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1) Out[12]:      a   b   c 0 NaN NaN NaN 1   1   2   3 

Otherwise read_csv gets a little confused:

In [13]: pd.read_csv(StringIO(s), sep=',', comment='#') Out[13]:          Unnamed: 0 a   b            c NaN NaN        NaN 1   2            3 

This seems to be the case in 0.12.0, I've filed a bug report.

As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):

In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all') Out[14]:     a  b  c 1  1  2  3 

Note: the default index will "give away" the fact there was missing data.

like image 38
Andy Hayden Avatar answered Oct 09 '22 09:10

Andy Hayden