I am trying to extract citations from a PDF. I confirmed that my regex worked on Rubular here but when I test my code on a real PDF it spits out some oddly spaced and wrong information. How can I fix this regex so it only extracts APA paper citations (the ones in the references section, not in-text). The APA Examples might be useful. I am trying to get references from a research paper. If you need more details please let me know. Multiple regex are acceptable for this answer, but I do need to be able to extract author, title, date, and journal. My attempt is below if it helps anybody:
require 'pdf-reader'
io = open('https://vhil.stanford.edu/pubs/2007/yee-proteus-effect.pdf')
out=open('dump.txt',"w")
reader = PDF::Reader.new(io)
reader.pages.each do |page|
/([a-zA-Z.,&\s]+?)(\(\d+\)).([\sa-zA-Z,:\n\t]+).([\sa-zA-Z,]+).([\sa-zA-Z,]+)/.match(page.text){|m|
puts "===CITATION===="
puts "author: "+m[0].to_str.gsub(/\n\r\t/,'')
puts "title: "+m[2].to_str.gsub(/\n\r\t/,'')
puts "date: "+m[1].to_str.gsub(/\n\r\t/,'')
puts "journal: "+m[3].to_str.gsub(/\n\r\t/,'')
}
#puts page.raw_content
end
puts"\n\n\n=======\n\n\n======"
puts reader.pages.last
More examples (in response to comments) here and here THE ENTIRE PAPERhere
To get these examples I ran out.puts page.text
inside my foreach loop. Then I copied chunks of text into Rubular and tested using my original regex (above).
Most importantly, regexes should not be used for parsing strings. You should instead use or write a bespoke parser. For example, you can't parse HTML with regex (in Python, use BeautifulSoup; in JavaScript, use the DOM).
The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.
A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.
By default, regular expressions will match any part of a string. It's often useful to anchor the regular expression so that it matches from the start or end of the string: ^ matches the start of string. $ matches the end of the string.
EDIT
Try this one, I tried to modify it in such way, to match positions from your comment.
^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^()]+?[?.!])\s*(?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper presented))
Rebular DEMO
Regex101 DEMO
The last part (?:Retrieved|Paper presented)
could be extended with other word which could occur after title.
This regex consist two main parts:
^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^()]+?[?.!])\s*
the shared part which is matching authors, date and title:
(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)
- author
group, match from beginning of a line, only if word starts from capital
letter, to avoid matching somewhere in the text, nod in
references, followed by letters, whitespaces, some punctuation,
etc. You can always add some symbols to [A-Za-z\s&.,'’]
character class, if there is more character, which can occur in
this part of reference.\((?<year>\d{4})\)
- year
group capture digits, if they are inside brackets,(?<title>[^()]+?[?.!])\s*
- title
group capture one of more of any characters, but not parentheses, followed by character
from character class [?.!]
, I used [^()]
because during tests
I found that its prevents regex from multiline invalid matching,
what also important, this part matching is restricted by
alternatives, without them, it would give invalid results, so it
is not matching independently separate parts,(?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper
presented))
alternatives for matching rest of content.
(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))
-
this alternative match jurnal title nad issue. jurnal
group
matches the fragment beginning with capital letter, followed by
not points (not .
), with lazy quantifire (match just as much as
it is necessary to succeed) and followed by comma. The [^.]
is
used, because some jurnal titles consist commas, I couldn't use
[^,]
which was my first idea, so I restreined matching of this
part with point, which occour always in the end of reference,
with reluctant quantifire, it allaws to give up already matched
fragments (up to point) for following matching. The issue
group
match digits with some content up to next comma or point, if it is
followed by comma with digits (pages numbers) or point and word
retrived,(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)
- this part match reference to edited book, editor
group match
anything but not paratheses (up to Eds. or Ed. key word) followed
by title restricted by next paratheses with pages numbers or
point(?:[^():]+:[^().]+\.)
- this part is used to match references only with information about publisher and place of publication,
the previous approuch with using a ?
on whole alteratives part
was not effective, because it matched also in places where
another alternative should be matched,(?:Retrieved|Paper presented))
- this part is for matching references which refer to online source or presentation, etc. I
could be extended with other keywords,old attempts
If you need only author, date, title and jurnal with issue, you can try with:
^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?!\s*Retrieved)(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))
DEMO rebular
DEMO regex101
however if you are interested also in edited books, try with:
^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?:(?<retrieved>[Rr]etrieved.+)|(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))|\s*In(?<editors>[^\(]+)\(Eds\.\),(?<book>[^.()]+))?
DEMO rebular
DEMO regex101
Both regular expression will capture relevant values into named groups: author, year, title, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With