Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beginners Python: Regex & Phone Numbers

Tags:

python

regex

Working my way through a beginners Python book and there's two fairly simple things I don't understand, and was hoping someone here might be able to help.

The example in the book uses regular expressions to take in email addresses and phone numbers from a clipboard and output them to the console. The code looks like this:

#! python3
# phoneAndEmail.py - Finds phone numbers and email addresses on the clipboard.

import pyperclip, re

# Create phone regex.
phoneRegex = re.compile(r'''(
(\d{3}|\(\d{3}\))?              #[1] area code
(\s|-|\.)?                      #[2] separator
(\d{3})                         #[3] first 3 digits
(\s|-|\.)                       #[4] separator
(\d{4})                         #[5] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[6] extension
)''', re.VERBOSE)

# Create email regex.
emailRegex = re.compile(r'''(
[a-zA-Z0-9._%+-]+   
@                   
[\.[a-zA-Z0-9.-]+   
(\.[a-zA-Z]{2,4})   
)''', re.VERBOSE)

# Find matches in clipboard text.
text = str(pyperclip.paste())           
matches = []                             

for groups in phoneRegex.findall(text):  
    phoneNum = '-'.join([groups[1], groups[3], groups[5]])
    if groups [8] != '':
        phoneNum += ' x' + groups[8]
    matches.append(phoneNum)

for groups in emailRegex.findall(text):
    matches.append(groups[0])           

# Copy results to the clipboard.
if len(matches) > 0:                    
    pyperclip.copy('\n'.join(matches))
    print('Copied to Clipboard:')
    print('\n'.join(matches))
else:
    print('No phone numbers of email addresses found')

Okay, so firstly, I don't really understand the phoneRegex object. The book mentions that adding parentheses will create groups in the regular expression.

If that's the case, are my assumed index values in the comments wrong and should there really be two groups in the index marked one? Or if they're correct, what does groups[7,8] refer to in the matching loop below for phone numbers?

Secondly, why does the emailRegex use a mixture of lists and tuples, while the phoneRegex uses mainly tuples?

Edit 1

Thanks for the answers so far, they've been helpful. Still kind of confused on the first part though. Should there be eight indexes like rock321987's answer or nine like sweaver2112's one?

Edit 2

Answered, thank you.

like image 432
rsylatian Avatar asked Dec 24 '22 05:12

rsylatian


1 Answers

every opening left ( marks the beginning of a capture group, and you can nest them:

(                               #[1] around whole pattern
(\d{3}|\(\d{3}\))?              #[2] area code
(\s|-|\.)?                      #[3] separator
(\d{3})                         #[4] first 3 digits
(\s|-|\.)                       #[5] separator
(\d{4})                         #[6] last 4 digits
(\s*(ext|x|ext.)\s*(\d{2,5}))?  #[7,8,9] extension
)

You should use named groups here (?<groupname>pattern), along with clustering only parens (?:pattern) that don't capture anything. And remember, you should capture quantified constructs, not quantify captured constructs:

(?<areacode>(?:\d{3}|\(\d{3}\))?)
(?<separator>(?:\s|-|\.)?)
(?<exchange>\d{3})
(?<separator2>\s|-|\.)
(?<lastfour>\d{4})
(?<extension>(?:\s*(?:ext|x|ext.)\s*(?:\d{2,5}))?)
like image 162
Scott Weaver Avatar answered Dec 26 '22 19:12

Scott Weaver