Using ^ to match beginning of line in Python regex

Tags:

regex

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):

Click to copy

PY 2015

For the script I'm writing I have defined the following regex function:

Click to copy

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.

So, I want to only match the pattern at the beginning of a line. Normally I would use ^ for this purpose, but r'^PY (\d\d\d\d)' fails at matching my results. On the other hand, using \n seems to do what I want, but that might lead to further complications for me.

305

asked Jul 14 '15 07:07

chrisk

3 Answers

Click to copy

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

should work

100

answered Oct 04 '22 01:10

sinhayash

You can simply add (?m) inline modifier flag to the start of the pattern:

Click to copy

(?m)^PY\s+(\d{4})
^^^^

Do not confuse with (?s)! (?s) is a DOTALL inline flag that makes . match any characters including line break characters.

Alternatively, you can use re.search with re.M or re.MULTILINE option:

Click to copy

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str))

See an IDEONE demo.

EXPLANATION:

^ - Start of a line (due to re.M)
PY - Literal PY
\s+ - 1 or more whitespace
(\d{4}) - Capture group holding 4 digits

answered Oct 04 '22 01:10

Wiktor Stribiżew

In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.find for this job. The find function returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:

Click to copy

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:

Click to copy

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:

Click to copy

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'

answered Oct 04 '22 00:10

mac13k

Related questions
                            
                                ElasticSearch updates are not immediate, how do you wait for ElasticSearch to finish updating it's index?
                            
                                Python Headless MatplotLib / Pyplot [duplicate]
                            
                                List as a member of a python class, why is its contents being shared across all instances of the class?
                            
                                How to determine if Python script was run via command line?
                            
                                How to convert `ctime` to `datetime` in Python?
                            
                                Pandas: create named columns in DataFrame from dict
                            
                                Django test coverage vs code coverage
                            
                                Are functions objects in Python?
                            
                                tkinter: how to use after method
                            
                                Persisting data in Google Colaboratory
                            
                                Creating graph with date and time in axis labels with matplotlib
                            
                                Django admin hangs (until timeout error) for a specific model when trying to edit/create
                            
                                Setting LD_LIBRARY_PATH from inside Python
                            
                                Django ORM, group by day
                            
                                Rendering a dictionary in Jinja2
                            
                                Cassandra: File "cqlsh", line 95 except ImportError, e:
                            
                                How to get single value from dict with single entry?
                            
                                Specific reasons to favor pip vs. conda when installing Python packages
                            
                                How do I re-map python dict keys
                            
                                Can you make a python subprocess output stdout and stderr as usual, but also capture the output as a string? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using ^ to match beginning of line in Python regex

Tags:

python

regex

chrisk

People also ask

3 Answers

sinhayash

Wiktor Stribiżew

mac13k

Recent Activity

Donate For Us