Regex parsing citation issue

Tags:

I am trying to extract citations from a PDF. I confirmed that my regex worked on Rubular here but when I test my code on a real PDF it spits out some oddly spaced and wrong information. How can I fix this regex so it only extracts APA paper citations (the ones in the references section, not in-text). The APA Examples might be useful. I am trying to get references from a research paper. If you need more details please let me know. Multiple regex are acceptable for this answer, but I do need to be able to extract author, title, date, and journal. My attempt is below if it helps anybody:

Click to copy

require 'pdf-reader'
io = open('https://vhil.stanford.edu/pubs/2007/yee-proteus-effect.pdf')
out=open('dump.txt',"w")
reader = PDF::Reader.new(io)

reader.pages.each do |page|
    /([a-zA-Z.,&\s]+?)(\(\d+\)).([\sa-zA-Z,:\n\t]+).([\sa-zA-Z,]+).([\sa-zA-Z,]+)/.match(page.text){|m|
        puts "===CITATION===="
        puts "author: "+m[0].to_str.gsub(/\n\r\t/,'')
        puts "title: "+m[2].to_str.gsub(/\n\r\t/,'')
        puts "date: "+m[1].to_str.gsub(/\n\r\t/,'')
        puts "journal: "+m[3].to_str.gsub(/\n\r\t/,'')
  }
  #puts page.raw_content
end
puts"\n\n\n=======\n\n\n======"
puts reader.pages.last

More examples (in response to comments) here and here THE ENTIRE PAPERhere

To get these examples I ran out.puts page.text inside my foreach loop. Then I copied chunks of text into Rubular and tested using my original regex (above).

337

asked Sep 25 '15 04:09

Rilcon42

1 Answers

EDIT

Try this one, I tried to modify it in such way, to match positions from your comment.

Click to copy

^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^()]+?[?.!])\s*(?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))\(Eds?\.\),\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper presented))

Rebular DEMO
Regex101 DEMO

The last part (?:Retrieved|Paper presented) could be extended with other word which could occur after title.

This regex consist two main parts:

^(?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+)$(?<year>\d{4})$\.?\s*(?<title>[^()]+?[?.!])\s* the shared part which is matching authors, date and title:
- (?<author>[A-Z](?:(?!$)[A-Za-z\s&.,'’])+) - author group, match from beginning of a line, only if word starts from capital letter, to avoid matching somewhere in the text, nod in references, followed by letters, whitespaces, some punctuation, etc. You can always add some symbols to [A-Za-z\s&.,'’] character class, if there is more character, which can occur in this part of reference.
- $(?<year>\d{4})$ - year group capture digits, if they are inside brackets,
- (?<title>[^()]+?[?.!])\s* - title group capture one of more of any characters, but not parentheses, followed by character from character class [?.!], I used [^()] because during tests I found that its prevents regex from multiline invalid matching, what also important, this part matching is restricted by alternatives, without them, it would give invalid results, so it is not matching independently separate parts,
(?:(?:(?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret))|(?:In\s*(?<editors>[^()]+))$Eds?\.$,\s*(?<book>[^().]+)|(?:[^():]+:[^().]+\.)|(?:Retrieved|Paper presented)) alternatives for matching rest of content.
- (?<jurnal>(?:(?!^[A-Z])[^.]+?)),\s*(?<issue>\d+)[^,.]*(?=,\s*\d+|.\s*Ret)) - this alternative match jurnal title nad issue. jurnal group matches the fragment beginning with capital letter, followed by not points (not .), with lazy quantifire (match just as much as it is necessary to succeed) and followed by comma. The [^.] is used, because some jurnal titles consist commas, I couldn't use [^,] which was my first idea, so I restreined matching of this part with point, which occour always in the end of reference, with reluctant quantifire, it allaws to give up already matched fragments (up to point) for following matching. The issue group match digits with some content up to next comma or point, if it is followed by comma with digits (pages numbers) or point and word retrived,
- (?:In\s*(?<editors>[^()]+))$Eds?\.$,\s*(?<book>[^().]+) - this part match reference to edited book, editor group match anything but not paratheses (up to Eds. or Ed. key word) followed by title restricted by next paratheses with pages numbers or point
- (?:[^():]+:[^().]+\.) - this part is used to match references only with information about publisher and place of publication, the previous approuch with using a ? on whole alteratives part was not effective, because it matched also in places where another alternative should be matched,
- (?:Retrieved|Paper presented)) - this part is for matching references which refer to online source or presentation, etc. I could be extended with other keywords,

old attempts

If you need only author, date, title and jurnal with issue, you can try with:

Click to copy

^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?!\s*Retrieved)(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))

DEMO rebular
DEMO regex101

however if you are interested also in edited books, try with:

Click to copy

^(?<author>(?:(?!$)[A-Za-z\s&.,'’])+)\((?<year>\d{4})\)\.?\s*(?<title>[^?.!]+?[.?!])\s*(?:(?<retrieved>[Rr]etrieved.+)|(?:(?:(?<jurnal>(?:(?!^[A-Z])[^,])+?),\s*(?<issue>\d+)))|\s*In(?<editors>[^\(]+)\(Eds\.\),(?<book>[^.()]+))?

DEMO rebular
DEMO regex101

Both regular expression will capture relevant values into named groups: author, year, title, etc.

answered Sep 27 '22 22:09

m.cekiera

Related questions
                            
                                Bogus escape error when running regex
                            
                                batch renaming of files with perl expressions
                            
                                SQL Server 2012 : extract Regex groups
                            
                                Controlling balanced parenthesis
                            
                                Powershell search matching string in word document
                            
                                Splitting a String works in Java, doesn't work on Android
                            
                                Formatting long datetime string to remove T character
                            
                                Match double hyphens in comments of malformed XML
                            
                                Regex for comma separated characters
                            
                                How to use regex inside in query using morphia?
                            
                                Match words which contain *N* times a certain letter
                            
                                Python Regex Get String Between Two Substrings
                            
                                sed (in bash) works with [ \t] but not with \s?
                            
                                Get domain with its subdomain from url
                            
                                How to get first character that is causing reg expression not to match
                            
                                replacing numbers in a string with regex in javascript
                            
                                Regex replace everything not starting with multiple times in line
                            
                                Python: regex findall
                            
                                Regex get stuck for some records
                            
                                NSRegularExpression cannot find capturing group matches

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Regex parsing citation issue

Tags:

regex

pdf

ruby-2.0

Rilcon42

People also ask

1 Answers

m.cekiera

Recent Activity

Donate For Us