Extract email sub-strings from large document

Tags:

string

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

...<[email protected]>...

What is the best way to have Python to cycle through the entire .txt file looking for a all instances of a certain @domain string, and then grab the entirety of the address within the <...>'s, and add it to a list? The trouble I have is with the variable length of different addresses.

737

asked Jul 16 '13 16:07

user1893148

1 Answers

This code extracts the email addresses in a string. Use it while reading line by line

>>> import re >>> line = "should we use regex more often? let me know at  [email protected]" >>> match = re.search(r'[\w.+-]+@[\w-]+\.[\w.-]+', line) >>> match.group(0) '[email protected]'

If you have several email addresses use findall:

>>> line = "should we use regex more often? let me know at  [email protected] or [email protected]" >>> match = re.findall(r'[\w.+-]+@[\w-]+\.[\w.-]+', line) >>> match ['[email protected]', '[email protected]']

The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.

Edit: as suggested in a comment by @kostek: In the string Contact us at [email protected]. my regex returns [email protected]. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture [email protected] as well.

Edit III: Added further improvements as discussed in the comments: "In addition to allowing + in the beginning of the address, this also ensures that there is at least one period in the domain. It allows multiple segments of domain like abc.co.uk as well, and does NOT match bad@ss :). Finally, you don't actually need to escape periods within a character class, so it doesn't do that."

191

answered Oct 09 '22 22:10

0x90

Related questions
                            
                                Opencv polylines function in python throws exception
                            
                                ImportError: No module named _io in ubuntu 14.04
                            
                                using sqlalchemy to load csv file into a database
                            
                                How do I install boto?
                            
                                Using MultipartPostHandler to POST form-data with Python
                            
                                Find path to currently running file
                            
                                How to install a Python package from within IPython?
                            
                                Save and run at the same time in Vim
                            
                                Find out the percentage of missing values in each column in the given dataset
                            
                                Duplicate each member in a list
                            
                                Are object literals Pythonic?
                            
                                pymysql fetchall() results as dictionary?
                            
                                django development server, how to stop it when it run in background
                            
                                Adding REST to Django [closed]
                            
                                How to use boolean 'and' in Python [duplicate]
                            
                                How to create a password entry field using Tkinter
                            
                                Modifying global variables in Python unittest framework
                            
                                How do I call a function twice or more times consecutively?
                            
                                OpenCV 2.4 VideoCapture not working on Windows
                            
                                Ending an infinite while loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With