Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing BibTeX citation format with Python

What is the best way in python to parse these results? I have tried regex but can't get it to work. I am looking for a dictionary of title, author etc as keys.

@article{perry2000epidemiological,
  title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study},
  author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
  journal={Journal of public health},
  volume={22},
  number={3},
  pages={427--434},
  year={2000},
  publisher={Oxford University Press}
}
like image 886
gmoorevt Avatar asked Sep 11 '25 07:09

gmoorevt


1 Answers

This looks like a citation format. You could parse it like this:

>>> import re

>>> kv = re.compile(r'\b(?P<key>\w+)={(?P<value>[^}]+)}')

>>> citation = """
... @article{perry2000epidemiological,
...   title={An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence
...  Study},
...   author={Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and
...  Smith, Nigel and Clarke, Michael and Jagger, Carol and others},
...   journal={Journal of public health},
...   volume={22},
...   number={3},
...   pages={427--434},
...   year={2000},
...   publisher={Oxford University Press}
... }
... """

>>> dict(kv.findall(citation))
{'author': 'Perry, Sarah and Shaw, Christine and Assassa, Philip and Dallosso, Helen and Williams, Kate and Brittain, Katherine R and Mensah, Fiona and Smith, Nigel and Clarke, Michael and Jagger, Carol and others',
 'journal': 'Journal of public health',
 'number': '3',
 'pages': '427--434',
 'publisher': 'Oxford University Press',
 'title': 'An epidemiological study to establish the prevalence of urinary symptoms and felt need in the community: the Leicestershire MRC Incontinence Study',
 'volume': '22',
 'year': '2000'}

The regex uses two named capturing groups (mainly just to visually denote what's what).

  • "key" is any 1+ unicode word character, with a word boundary on the left and literal equals sign to its right;
  • "value" is something inside two curly brackets. You can use [^}] conveniently as long as you don't expect to have "nested" curly brackets. In other words, the values are just one or more of any characters that aren't curly brackets, inside of curly brackets.
like image 188
Brad Solomon Avatar answered Sep 12 '25 21:09

Brad Solomon