Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex parsing citation issue

I am trying to extract citations from a PDF. I confirmed that my regex worked on Rubular here but when I test my code on a real PDF it spits out some oddly spaced and wrong information. How can I fix this regex so it only extracts APA paper citations (the ones in the references section, not in-text). The APA Examples might be useful. I am trying to get references from a research paper. If you need more details please let me know. Multiple regex are acceptable for this answer, but I do need to be able to extract author, title, date, and journal. My attempt is below if it helps anybody:

require 'pdf-reader'
io = open('https://vhil.stanford.edu/pubs/2007/yee-proteus-effect.pdf')
out=open('dump.txt',"w")
reader = PDF::Reader.new(io)

reader.pages.each do |page|
    /([a-zA-Z.,&\s]+?)(\(\d+\)).([\sa-zA-Z,:\n\t]+).([\sa-zA-Z,]+).([\sa-zA-Z,]+)/.match(page.text){|m|
        puts "===CITATION===="
        puts "author: "+m[0].to_str.gsub(/\n\r\t/,'')
        puts "title: "+m[2].to_str.gsub(/\n\r\t/,'')
        puts "date: "+m[1].to_str.gsub(/\n\r\t/,'')
        puts "journal: "+m[3].to_str.gsub(/\n\r\t/,'')
  }
  #puts page.raw_content
end
puts"\n\n\n=======\n\n\n======"
puts reader.pages.last

More examples (in response to comments) here and here THE ENTIRE PAPERhere

To get these examples I ran out.puts page.text inside my foreach loop. Then I copied chunks of text into Rubular and tested using my original regex (above).

like image 337
Rilcon42 Avatar asked Sep 25 '15 04:09

Rilcon42


People also ask

Should you use regex in a parser?

Most importantly, regexes should not be used for parsing strings. You should instead use or write a bespoke parser. For example, you can't parse HTML with regex (in Python, use BeautifulSoup; in JavaScript, use the DOM).

What is parsing in regex?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.

How does regex work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.

What will the regular expression match?

By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.


1 Answers

EDIT

Try this one, I tried to modify it in such way, to match positions from your comment.

^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^()]+?[?.!])\s*(?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper presented))

Rebular DEMO
Regex101 DEMO

The last part (?:Retrieved|Paper presented) could be extended with other word which could occur after title.

This regex consist two main parts:

  1. ^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^()]+?[?.!])\s* the shared part which is matching authors, date and title:

    • (?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+) - author group, match from beginning of a line, only if word starts from capital letter, to avoid matching somewhere in the text, nod in references, followed by letters, whitespaces, some punctuation, etc. You can always add some symbols to [A-Za-z\s&.,'’] character class, if there is more character, which can occur in this part of reference.
    • \((?<year>\d{4})\) - year group capture digits, if they are inside brackets,
    • (?<title>[^()]+?[?.!])\s* - title group capture one of more of any characters, but not parentheses, followed by character from character class [?.!], I used [^()] because during tests I found that its prevents regex from multiline invalid matching, what also important, this part matching is restricted by alternatives, without them, it would give invalid results, so it is not matching independently separate parts,
  2. (?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper presented)) alternatives for matching rest of content.

    • (?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret)) - this alternative match jurnal title nad issue. jurnal group matches the fragment beginning with capital letter, followed by not points (not .), with lazy quantifire (match just as much as it is necessary to succeed) and followed by comma. The [^.] is used, because some jurnal titles consist commas, I couldn't use [^,] which was my first idea, so I restreined matching of this part with point, which occour always in the end of reference, with reluctant quantifire, it allaws to give up already matched fragments (up to point) for following matching. The issue group match digits with some content up to next comma or point, if it is followed by comma with digits (pages numbers) or point and word retrived,
    • (?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+) - this part match reference to edited book, editor group match anything but not paratheses (up to Eds. or Ed. key word) followed by title restricted by next paratheses with pages numbers or point
    • (?:[^():]+:[^().]+\.) - this part is used to match references only with information about publisher and place of publication, the previous approuch with using a ? on whole alteratives part was not effective, because it matched also in places where another alternative should be matched,
    • (?:Retrieved|Paper presented)) - this part is for matching references which refer to online source or presentation, etc. I could be extended with other keywords,

old attempts

If you need only author, date, title and jurnal with issue, you can try with:

^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?!\s*Retrieved)(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))

DEMO rebular
DEMO regex101

however if you are interested also in edited books, try with:

^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?:(?<retrieved>[Rr]etrieved.+)|(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))|\s*In(?<editors>[^\(]+)\(Eds\.\),(?<book>[^.()]+))?

DEMO rebular
DEMO regex101

Both regular expression will capture relevant values into named groups: author, year, title, etc.

like image 97
m.cekiera Avatar answered Sep 27 '22 22:09

m.cekiera