Python script to search PII

Tags:

I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.

Any starting tips or which lib to use are welcome.

I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.

335

asked May 16 '12 18:05

Novice123

2 Answers

If you're working for a company, you could consider buying a packaged solution. One I've seen advertised is Nuix. Also, Oracle has an end-to-end solution for GDPR (the new EU privacy law), which includes the kind of functionality you describe. See http://www.oracle.com/technetwork/database/security/wp-security-dbsec-gdpr-3073228.pdf.

If you have the Oracle RDBMS, there is a package called CTXSYS (now called Oracle Text) which has amazing search capabilities across documents, including PDFs, the entire Office suite, and many more. CTXSYS is included in the regular license. If you're a home user, you can download Oracle server (the Express version is fine for this function).

If you're using regexes as suggested above, one simple approach would be to search for words that are capitalized in mid-sentence, but that only helps with documents (not so much with XLS, for example). You could also build a dictionary of common names (first/last names, streets, towns). The credit cards and SSNs should be readily regex-able.

answered Oct 14 '22 18:10

Ken

give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0

or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex

152

answered Oct 14 '22 16:10

Don Johnson

Related questions
                            
                                Using Wave Python Module to Get and Write Audio
                            
                                Using scipy.weave.inline for fast 2D median filtering
                            
                                h5py gives error after installation [duplicate]
                            
                                Windows (XP to Windows 7) audio playback with python?
                            
                                Parse svg:path d attribute
                            
                                Using a TLB-defined interface with Python and COM
                            
                                pandas aggregated data to a numpy array : data structure conversion
                            
                                Playing music with Pyglet and Tkinter in Python
                            
                                Can we keep python & php both with same apache server?
                            
                                Cross compiling a python script on windows into linux executable
                            
                                Pandas shuffle rows at a certain level
                            
                                Want to create a personality test in python. How to do functions for this task?
                            
                                Text processing with two files
                            
                                Old-style classes, new-style classes and metaclasses
                            
                                How to organize my Python code into multiple classes?
                            
                                python-ldap failed to install in Heroku
                            
                                OpenERP :Simple use of fields.function
                            
                                How do I change column type on SQLAlchemy declarative model dynamically?
                            
                                Is using 'exec' under controlled conditions a security threat?
                            
                                Selenium: Testing pop-up windows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python script to search PII

Tags:

python

privacy

pii

Novice123

People also ask

2 Answers

Ken

Don Johnson

Recent Activity

Donate For Us