Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ^ to match beginning of line in Python regex

Tags:

python

regex

I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):

PY 2015

For the script I'm writing I have defined the following regex function:

import re
f = open('savedrecs.txt')
wosrecords = f.read()

def findyears():
    result = re.findall(r'PY (\d\d\d\d)', wosrecords)
    print result

findyears()

This, however, gives false positive results because the pattern may appear elsewhere in the data.

So, I want to only match the pattern at the beginning of a line. Normally I would use ^ for this purpose, but r'^PY (\d\d\d\d)' fails at matching my results. On the other hand, using \n seems to do what I want, but that might lead to further complications for me.

like image 305
chrisk Avatar asked Jul 14 '15 07:07

chrisk


People also ask

How do you match a pattern exactly at the beginning in Python?

match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.

How to match a regex pattern inside a string in Python?

Python re.match () method looks for the regex pattern only at the beginning of the target string and returns match object if match found; otherwise, it will return None. In this article, You will learn how to match a regex pattern inside the target string using the match (), search (), and findall () method of a re module.

How do I match against a string in Python?

A pattern defined using RegEx can be used to match against a string. Matched? Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re.match (pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")

How to use regular expressions with regex in Python?

RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions. When you have imported the re module, you can start using regular expressions: The re module offers a set of functions that allows us to search a string for a match:

How do you search for a pattern in regex?

The re.search () method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string. If the search is successful, re.search () returns a match object; if not, it returns None.


3 Answers

re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)

should work

like image 100
sinhayash Avatar answered Oct 04 '22 01:10

sinhayash


You can simply add (?m) inline modifier flag to the start of the pattern:

(?m)^PY\s+(\d{4})
^^^^

Do not confuse with (?s)! (?s) is a DOTALL inline flag that makes . match any characters including line break characters.

Alternatively, you can use re.search with re.M or re.MULTILINE option:

import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str)) 

See an IDEONE demo.

EXPLANATION:

  • ^ - Start of a line (due to re.M)
  • PY - Literal PY
  • \s+ - 1 or more whitespace
  • (\d{4}) - Capture group holding 4 digits
like image 37
Wiktor Stribiżew Avatar answered Oct 04 '22 01:10

Wiktor Stribiżew


In this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.find for this job. The find function returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:

In [12]: 'PY 2015'.find('PY')
Out[12]: 0

In [13]: ' PY 2015'.find('PY')
Out[13]: 1

Perhaps it could be a good idea to strip the white spaces, ie.:

In [14]: '  PY 2015'.find('PY')
Out[14]: 2

In [15]: '  PY 2015'.strip().find('PY')
Out[15]: 0

And next if only the year is of interest it can be extracted with split, ie.:

In [16]: '  PY 2015'.strip().split()[1]
Out[16]: '2015'
like image 30
mac13k Avatar answered Oct 04 '22 00:10

mac13k