Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

process large text file in python

I have a very large file (3.8G) that is an extract of users from a system at my school. I need to reprocess that file so that it just contains their ID and email address, comma separated.

I have very little experience with this and would like to use it as a learning exercise for Python.

The file has entries that look like this:

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: [email protected]

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: [email protected]

I am trying to get a file that looks like:

0099886,[email protected]
0083156,[email protected]

Any tips or code?

like image 373
Alistair Avatar asked Dec 04 '22 20:12

Alistair


1 Answers

That actually looks like an LDIF file to me. The python-ldap library has a pure-Python LDIF handling library that could help if your file possesses some of the nasty gotchas possible in LDIF, e.g. Base64-encoded values, entry folding, etc.

You could use it like so:

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

Edit

So to extract from a live LDAP directory, using the python-ldap library you would want to do something like this:

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

It's probably worthwhile reading through the documentation for the ldap module, especially the example.

Note that in the example above, I completely skipped supplying a filter, which you would probably want to do in production. A filter in LDAP is similar to the WHERE clause in a SQL statement; it restricts what objects are returned. Microsoft actually has a good guide on LDAP filters. The canonical reference for LDAP filters is RFC 4515.

Similarly, if there are potentially several thousand entries even after applying an appropriate filter, you may need to look into the LDAP paging control, though using that would, again, make the example more complex. Hopefully that's enough to get you started, but if anything comes up, feel free to ask or open a new question.

Good luck.

like image 174
ig0774 Avatar answered Dec 26 '22 09:12

ig0774