Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for recognizing in-text citations

Tags:

regex

I'm trying to create a regular expression to capture in-text citations.

Here's a few example sentences of in-text citations:

  1. ... and the reported results in (Nivre et al., 2007) were not representative ...

  2. ... two systems used a Markov chain approach (Sagae and Tsujii 2007).

  3. Nivre (2007) showed that ...

  4. ... for attaching and labeling dependencies (Chen et al., 2007; Dredze et al., 2007).

Currently, the regular expression I have is

\(\D*\d\d\d\d\)

Which matches examples 1-3, but not example 4. How can I modify this to capture example 4?

Thanks!

like image 526
mawhidby Avatar asked Dec 01 '10 03:12

mawhidby


2 Answers

Building on Tex's answer, I've written a very simple Python script called Overcite to do this for a friend (end of semester, lazy referencing you know how it is). It's open source and MIT licensed on Bitbucket.

It covers a few more cases than Tex's which might be helpful (see the test file), including ampersands and references with page numbers. The whole script is basically:

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)
like image 188
orlade Avatar answered Oct 05 '22 23:10

orlade


\((.+?)\) should capture all of them

like image 21
Breezer Avatar answered Oct 06 '22 00:10

Breezer