I'm using read_csv
to read CSV files into Pandas data frames. My CSV files contain large numbers of decimals/floats. The numbers are encoded using the European decimal notation:
1.234.456,78
This means that the '.' is used as the thousand separator and the ',' is the decimal mark.
Pandas 0.8. provides a read_csv
argument called 'thousands' to set the thousand separator. Is there an additional argument to provide the decimal mark as well? If no, what is the most efficient way to parse a European style decimal number?
Currently I'm using string replace which I consider to be a significant performance penalty. The coding I'm using is this:
# Convert to float data type and change decimal point from ',' to '.'
f = lambda x: string.replace(x, u',', u'.')
df['MyColumn'] = df['MyColumn'].map(f)
Any help is appreciated.
For European style numbers, use the thousands
and decimal
parameters in pandas.read_csv
.
For example:
pandas.read_csv('data.csv', thousands='.', decimal=',')
From the docs:
thousands :
str, optional Thousands separator.
decimal :
str, default ‘.’ Character to recognize as decimal point (e.g. use ‘,’ for European data).
You can use the converters
kw in read_csv
. Given /tmp/data.csv
like this:
"x","y"
"one","1.234,56"
"two","2.000,00"
you can do:
In [20]: pandas.read_csv('/tmp/data.csv', converters={'y': lambda x: float(x.replace('.','').replace(',','.'))})
Out[20]:
x y
0 one 1234.56
1 two 2000.00
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With