I'm trying to extract publication years ISI-style data from the Thomson-Reuters Web of Science. The line for "Publication Year" looks like this (at the very beginning of a line):
PY 2015
For the script I'm writing I have defined the following regex function:
import re
f = open('savedrecs.txt')
wosrecords = f.read()
def findyears():
result = re.findall(r'PY (\d\d\d\d)', wosrecords)
print result
findyears()
This, however, gives false positive results because the pattern may appear elsewhere in the data.
So, I want to only match the pattern at the beginning of a line. Normally I would use ^
for this purpose, but r'^PY (\d\d\d\d)'
fails at matching my results. On the other hand, using \n
seems to do what I want, but that might lead to further complications for me.
match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.
Python re.match () method looks for the regex pattern only at the beginning of the target string and returns match object if match found; otherwise, it will return None. In this article, You will learn how to match a regex pattern inside the target string using the match (), search (), and findall () method of a re module.
A pattern defined using RegEx can be used to match against a string. Matched? Python has a module named re to work with RegEx. Here's an example: import re pattern = '^a...s$' test_string = 'abyss' result = re.match (pattern, test_string) if result: print("Search successful.") else: print("Search unsuccessful.")
RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions. When you have imported the re module, you can start using regular expressions: The re module offers a set of functions that allows us to search a string for a match:
The re.search () method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string. If the search is successful, re.search () returns a match object; if not, it returns None.
re.findall(r'^PY (\d\d\d\d)', wosrecords, flags=re.MULTILINE)
should work
You can simply add (?m)
inline modifier flag to the start of the pattern:
(?m)^PY\s+(\d{4})
^^^^
Do not confuse with (?s)
! (?s)
is a DOTALL inline flag that makes .
match any characters including line break characters.
Alternatively, you can use re.search
with re.M
or re.MULTILINE
option:
import re
p = re.compile(r'^PY\s+(\d{4})', re.M)
test_str = "PY123\nPY 2015\nPY 2017"
print(re.findall(p, test_str))
See an IDEONE demo.
EXPLANATION:
^
- Start of a line (due to re.M
)PY
- Literal PY
\s+
- 1 or more whitespace(\d{4})
- Capture group holding 4 digitsIn this particular case there is no need to use regular expressions, because the searched string is always 'PY' and is expected to be at the beginning of the line, so one can use string.find
for this job. The find
function returns the position the substring is found in the given string or line, so if it is found at the start of the string the returned value is 0 (-1 if not found at all), ie.:
In [12]: 'PY 2015'.find('PY')
Out[12]: 0
In [13]: ' PY 2015'.find('PY')
Out[13]: 1
Perhaps it could be a good idea to strip the white spaces, ie.:
In [14]: ' PY 2015'.find('PY')
Out[14]: 2
In [15]: ' PY 2015'.strip().find('PY')
Out[15]: 0
And next if only the year is of interest it can be extracted with split, ie.:
In [16]: ' PY 2015'.strip().split()[1]
Out[16]: '2015'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With