Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python script to search PII

I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.

Any starting tips or which lib to use are welcome.

I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.

like image 335
Novice123 Avatar asked May 16 '12 18:05

Novice123


People also ask

How to identify if a given column is a PII?

There are a series of rules that are applied to a dataset's column to identify if a given column is a PII. Such rules are: If column name or label match with any word of the list of restricted words ( ex 'name', 'surname', 'ssn', etc; check restricted_words.py).

What is the likely PII tool?

This application identifies likely PII (personally identifiable information) in a dataset. To use, download the .exe installer from the latest release and follow the in-app directions. This tool is current listed as an alpha release because it is still being tested on IPA PII-containing field datasets. How does it work?

Is it possible to do a Google search with Python?

Here is a really concise article about using Python for google search by Saleh Alkhalifa, where he explains the options we have for doing google search with Python as well as the limitations of each framework: Googling is not just about text, we often also need to google images.

How to identify open ended questions in PII data?

If entries in a given column have a specific format (at the moment checking phone number format and date format, we can expand to gps, national identifiers, etc). Check find_piis_based_on_column_format () in PII_data_processory.py. If all entries in a given column are sufficiently sparse (almost all unique). Ideal to identify open ended questions.


2 Answers

If you're working for a company, you could consider buying a packaged solution. One I've seen advertised is Nuix. Also, Oracle has an end-to-end solution for GDPR (the new EU privacy law), which includes the kind of functionality you describe. See http://www.oracle.com/technetwork/database/security/wp-security-dbsec-gdpr-3073228.pdf.

If you have the Oracle RDBMS, there is a package called CTXSYS (now called Oracle Text) which has amazing search capabilities across documents, including PDFs, the entire Office suite, and many more. CTXSYS is included in the regular license. If you're a home user, you can download Oracle server (the Express version is fine for this function).

If you're using regexes as suggested above, one simple approach would be to search for words that are capitalized in mid-sentence, but that only helps with documents (not so much with XLS, for example). You could also build a dictionary of common names (first/last names, streets, towns). The credit cards and SSNs should be readily regex-able.

like image 25
Ken Avatar answered Oct 14 '22 18:10

Ken


give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0

or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex

like image 152
Don Johnson Avatar answered Oct 14 '22 16:10

Don Johnson