Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read_fwf in pandas in Python does not use comment character if colspecs argument does not include first column

When reading fixed-width files using the read_fwf function in pandas (0.18.1) with Python (3.4.3), it is possible to specify a comment character using the comment argument. I expected that all lines beginning with the comment character would be ignored. However, if you do not specify the first column in the file in any column in colspecs, the comment character does not appear to be used.

import io, sys
import pandas as pd

sys.version
# '3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)]'
pd.__version__
# '0.18.1'

# Two input files, first line is comment, second line is data.
# Second file has a column (with the letter A) 
# that I don't want at start of data.
string = "#\n1K\n"
off_string = "#\nA1K\n"

# When using skiprows to skip commented row, both work.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], skiprows = 1, header = None)
#    0  1
# 0  1  K

pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], skiprows = 1, header = None)
#    0  1
# 0  1  K

# If a comment character is specified, it only works when the colspecs 
# includes the column with the comment character.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], comment = '#', header = None)
#    0  1
# 0  1  K

pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], comment = '#', header = None)
#      0    1
# 0  NaN  NaN
# 1  1.0    K

Is there any documentation specifically referring to this? The simple workaround is to include the first column and then remove it after, but I wanted to verify if this was a bug or my misunderstanding the expected behaviour.

like image 778
nograpes Avatar asked Aug 30 '16 19:08

nograpes


1 Answers

I think this is a bug, the spec in the documentation says "if the line starts with a comment then the entire line is skipped". The problem is that columns are subsetted by FixedWidthReader.__next__ before they are checked for comments (in PythonParser or CParserWrapper). The relevant code is in io/parsers.py.

like image 99
maxymoo Avatar answered Nov 03 '22 08:11

maxymoo