I was reading an xlm file using pandas.read_html
and works almost perfect, the problem is that the file has commas as decimal separators instead of dots (the default in read_html
).
I could easily replace the commas by dots in one file, but i have almost 200 files with that configuration.
with pandas.read_csv
you can define the decimal separator, but i don't know why in pandas.read_html
you can only define the thousand separator.
any guidance in this matter?, there is another way to automate the comma/dot replacement before it is open by pandas? thanks in advance!
This did not start working for me until I used both decimal=',' and thousands='.'
Pandas version: 0.23.4
So try to use both decimal and thousands:
i.e.:
pd.read_html(io="http://example.com", decimal=',', thousands='.')
Before I would only use decimal=',' and the number columns would be saved as type str with the numbers just omitting the comma.(weird behaviour) For example 0,7 would be "07" and "1,9" would be "19"
It is still being saved in the dataframe as type str but at least I don't have to manually put in the dots. The numbers are correctly displayed; 0,7 -> "0.7"
I am using pandas 0.19 but it still fails to correctly convert the numbers.
For example:
a=pd.read_html(r.text,thousands='.',decimal=',')
will recognize the value "1.401,40" in a table cell as 140140 (float).
I use a similar solution as 'Pablo A', just correcting for nan values:
def to_numeric_comma(series):
new=series.apply(lambda x: str(x).replace('.','').replace(',','.'))
new=pd.to_numeric(new.replace('nan',pd.np.nan))
return new
Looking at the source code of read_html
def read_html(io, match='.+', flavor=None, header=None, index_col=None,
skiprows=None, attrs=None, parse_dates=False,
tupleize_cols=False, thousands=',', encoding=None,
decimal='.', converters=None, na_values=None,
keep_default_na=True):
The function header implies that there is a decimal separator available in the function call.
Further down in the documentation this looks like it was added in version 0.19 (so a bit further down the experimental branch). Can you upgrade your pandas?
decimal : str, default '.' Character to recognize as decimal point (e.g. use ',' for European data). .. versionadded:: 0.19.0
Thanks @zhqiat. I think upgrading pandas
to version 0.19
will solve the problem. unfortunately I couldn't found an easy way to accomplish that. I found a tutorial to upgrade Pandas but for ubuntu (winXP user).
I finally chose the workaround, using the method posted here, basically converting all columns, one by one, to a numeric type of pandas.Series
result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))
I know that this solution ain't the best, but works. Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With