Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: select string with unicode characters

I am trying to select rows by specifying the value of one of the columns. That works perfectly well, as long as the value selected is pure ascii. If however, it contains non-ascii characters, I cannot get it to work no matter how I encode the value.

Simplified example to illustrate the problem:

>>> from __future__ import (absolute_import, division, 
                            print_function, unicode_literals)
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
>>> df['city'] = df['city'].map(lambda x: x.encode('latin-1'))
>>> store = pd.HDFStore('test_store.h5')
>>> store.append('test_key', df, data_columns=True)
>>> store['test_key']
   id       city
0   1  Stuttgart
1   2    M�nchen

Note that the non-asci string is indeed properly stored:

>>> store['test_key']['city'][1]
'M\xfcnchen'

Selecting for asci value works just fine:

>>> store.select('test_key', where='city==%r' % 'Stuttgart')
   id       city
0   1  Stuttgart

But selecting for the non-ascii value fails to return the row:

>>> store.select('test_key', where='city==%r' % 'München')
Empty DataFrame
Columns: [id, city]
Index: []

>>> store.select('test_key', where='city==%r' % 'München'.encode('latin-1'))
Empty DataFrame
Columns: [id, city]
Index: []

Clearly I am doing something wrong... How does one solve this issue?

like image 441
ARF Avatar asked May 17 '26 15:05

ARF


1 Answers

Oddly, selection seems to work fine if the encoding is utf-8 instead of latin-1:

from __future__ import (absolute_import, division, 
                        print_function, unicode_literals)

import pandas as pd

df = pd.DataFrame([[1, 'Stuttgart'], [2, 'München']], columns=['id', 'city'])
df['city'] = df['city'].map(lambda x: x.encode('utf-8'))
store = pd.HDFStore('/tmp/test_store.h5', 'w')
store.append('test_key', df, data_columns=True)
print(store.select('test_key', where='city==%r' % 'Stuttgart'.encode('utf-8')))
#    id       city
# 0   1  Stuttgart

print(store.select('test_key', where='city==%r' % 'München'.encode('utf-8')))
#    id     city
# 1   2  München

store.close()
like image 188
unutbu Avatar answered May 20 '26 03:05

unutbu