When reading fixed-width files using the read_fwf function in pandas (0.18.1) with Python (3.4.3), it is possible to specify a comment character using the comment argument. I expected that all lines beginning with the comment character would be ignored. However, if you do not specify the first column in the file in any column in colspecs, the comment character does not appear to be used.
import io, sys
import pandas as pd
sys.version
# '3.4.3 (v3.4.3:9b73f1c3e601, Feb 24 2015, 22:43:06) [MSC v.1600 32 bit (Intel)]'
pd.__version__
# '0.18.1'
# Two input files, first line is comment, second line is data.
# Second file has a column (with the letter A)
# that I don't want at start of data.
string = "#\n1K\n"
off_string = "#\nA1K\n"
# When using skiprows to skip commented row, both work.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], skiprows = 1, header = None)
# 0 1
# 0 1 K
pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], skiprows = 1, header = None)
# 0 1
# 0 1 K
# If a comment character is specified, it only works when the colspecs
# includes the column with the comment character.
pd.read_fwf(io.StringIO(string), colspecs = [(0,1), (1,2)], comment = '#', header = None)
# 0 1
# 0 1 K
pd.read_fwf(io.StringIO(off_string), colspecs = [(1,2), (2,3)], comment = '#', header = None)
# 0 1
# 0 NaN NaN
# 1 1.0 K
Is there any documentation specifically referring to this? The simple workaround is to include the first column and then remove it after, but I wanted to verify if this was a bug or my misunderstanding the expected behaviour.
I think this is a bug, the spec in the documentation says "if the line starts with a comment then the entire line is skipped". The problem is that columns are subsetted by FixedWidthReader.__next__ before they are checked for comments (in PythonParser or CParserWrapper). The relevant code is in io/parsers.py.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With