Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with unicode encoded Strings from Active Directory via python-ldap

I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:

I am reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.

Example (fullentries is my array with all the AD entries):

fullentries[23][1].decode('utf-8', 'ignore')    
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])

A second test with a string inserted by hand as follows:

testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)

The output of the first example ist:

M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'

Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.

The output of the second example:

Müller
M�ller
'M\xc3\xbcller'

Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.

My older question this topic is here: Writing UTF-8 String to MySQL with Python

Edit: Added repr(s) infos

like image 282
Raptor Avatar asked Apr 22 '26 19:04

Raptor


1 Answers

First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.

You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.

UPDATED:

OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:

u.decode('unicode_escape').encode('iso8859-1').decode('utf8')

You might want to look into whether you can get the data in a more natural format.

like image 64
Ned Batchelder Avatar answered Apr 25 '26 09:04

Ned Batchelder



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!