I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.
Any starting tips or which lib to use are welcome.
I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.
There are a series of rules that are applied to a dataset's column to identify if a given column is a PII. Such rules are: If column name or label match with any word of the list of restricted words ( ex 'name', 'surname', 'ssn', etc; check restricted_words.py).
This application identifies likely PII (personally identifiable information) in a dataset. To use, download the .exe installer from the latest release and follow the in-app directions. This tool is current listed as an alpha release because it is still being tested on IPA PII-containing field datasets. How does it work?
Here is a really concise article about using Python for google search by Saleh Alkhalifa, where he explains the options we have for doing google search with Python as well as the limitations of each framework: Googling is not just about text, we often also need to google images.
If entries in a given column have a specific format (at the moment checking phone number format and date format, we can expand to gps, national identifiers, etc). Check find_piis_based_on_column_format () in PII_data_processory.py. If all entries in a given column are sufficiently sparse (almost all unique). Ideal to identify open ended questions.
If you're working for a company, you could consider buying a packaged solution. One I've seen advertised is Nuix. Also, Oracle has an end-to-end solution for GDPR (the new EU privacy law), which includes the kind of functionality you describe. See http://www.oracle.com/technetwork/database/security/wp-security-dbsec-gdpr-3073228.pdf.
If you have the Oracle RDBMS, there is a package called CTXSYS (now called Oracle Text) which has amazing search capabilities across documents, including PDFs, the entire Office suite, and many more. CTXSYS is included in the regular license. If you're a home user, you can download Oracle server (the Express version is fine for this function).
If you're using regexes as suggested above, one simple approach would be to search for words that are capitalized in mid-sentence, but that only helps with documents (not so much with XLS, for example). You could also build a dictionary of common names (first/last names, streets, towns). The credit cards and SSNs should be readily regex-able.
give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0
or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With