I think I misunderstand the intention of read_csv. If I have a file 'j' like
# notes a,b,c # more notes 1,2,3
How can I pandas.read_csv this file, skipping any '#' commented lines? I see in the help 'comment' of lines is not supported but it indicates an empty line should be returned. I see an error
df = pandas.read_csv('j', comment='#')
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
I'm currently on
In [15]: pandas.__version__ Out[15]: '0.12.0rc1'
On version'0.12.0-199-g4c8ad82':
In [43]: df = pandas.read_csv('j', comment='#', header=None)
CParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 3
By default, date columns are represented as object when loading data from a CSV file. To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.
index_col: This is to allow you to set which columns to be used as the index of the dataframe. The default value is None, and pandas will add a new column start from 0 to specify the index column. It can be set as a column name or column index, which will be used as the index column.
One of the optional parameters in read_csv() is sep, a shortened name for separator. This operator is the delimiter we talked about before. This sep parameter tells the interpreter, which delimiter is used in our dataset or in Layman's term, how the data items are separated in our CSV file.
You can use the following methods to skip rows when reading a CSV file into a pandas DataFrame: #import DataFrame and skip 2nd row df = pd.read_csv('my_data.csv', skiprows= [2]) #import DataFrame and skip 2nd and 4th row df = pd.read_csv('my_data.csv', skiprows= [2, 4])
A naive way to read a file and skip initial comment lines is to use “if” statement and check if each line starts with the comment character “#”. Python string has a nice method “startswith” to check if a string, in this case a line, starts with specific characters. For example, “#comment”.startswith (“#”) will return TRUE.
In combination of parameters header and skiprows - first the rows will be skipped and then first on of the remaining will be used as a header. In the example below 3 rows from the CSV file will be skipped.
Some useful parameters are given below : Method 1: Skipping N rows from the starting while reading a csv file. Method 2: Skipping rows at specific positions while reading a csv file. Method 3: Skipping N rows from the starting except column names while reading a csv file.
So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#'
parameter into pd.read_csv
and this should skip commented out lines.
These github issues shows that you can do this:
See the documentation on read_csv
: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
One workaround is to specify skiprows to ignore the first few entries:
In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3' In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1) Out[12]: a b c 0 NaN NaN NaN 1 1 2 3
Otherwise read_csv
gets a little confused:
In [13]: pd.read_csv(StringIO(s), sep=',', comment='#') Out[13]: Unnamed: 0 a b c NaN NaN NaN 1 2 3
This seems to be the case in 0.12.0, I've filed a bug report.
As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):
In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all') Out[14]: a b c 1 1 2 3
Note: the default index will "give away" the fact there was missing data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With