I am trying to parse the DBLP data set using lxml in python. However it is giving this error:
lxml.etree.XMLSyntaxError: Entity 'uuml' not defined, line 54, column 43
DBLP does provide a DTD file for defining entities here. How can I use that file to parse the DBLP XML document?
Here is my current code:
filename = sys.argv[1]
dtd_name = sys.argv[2]
db_name = sys.argv[3]
conn = sqlite3.connect(db_name)
dblp_record_types_for_publications = ('article', 'inproceedings', 'proceedings', 'book', 'incollection',
'phdthesis', 'masterthesis', 'www')
# read dtd
dtd = ET.DTD(dtd_name) #pylint: disable=E1101
# get an iterable
context = ET.iterparse(filename, events=('start', 'end'), load_dtd=True, #pylint: disable=E1101
resolve_entities=True)
# turn it into an iterator
context = iter(context)
# get the root element
event, root = next(context)
n_records_parsed = 0
for event, elem in context:
if event == 'end' and elem.tag in dblp_record_types_for_publications:
pub_year = None
for year in elem.findall('year'):
pub_year = year.text
if pub_year is None:
continue
pub_title = None
for title in elem.findall('title'):
pub_title = title.text
if pub_title is None:
continue
pub_authors = []
for author in elem.findall('author'):
if author.text is not None:
pub_authors.append(author.text)
# print(pub_year)
# print(pub_title)
# print(pub_authors)
# insert the publication, authors in sql tables
pub_title_sql_str = pub_title.replace("'", "''")
pub_author_sql_strs = []
for author in pub_authors:
pub_author_sql_strs.append(author.replace("'", "''"))
conn.execute("INSERT OR IGNORE INTO publications VALUES ('{title}','{year}')".format(
title=pub_title_sql_str,
year=pub_year))
for author in pub_author_sql_strs:
conn.execute("INSERT OR IGNORE INTO authors VALUES ('{name}')".format(name=author))
conn.execute("INSERT INTO authored VALUES ('{author}','{publication}')".format(author=author,
publication=pub_title_sql_str))
elem.clear()
root.clear()
n_records_parsed += 1
print("No. of records parsed: {}".format(n_records_parsed))
conn.commit()
conn.close()
After keeping the DTD file in the same directory as the XML file and making sure that DTD filename and the name of the DTD file in the doctype declaration (<!DOCTYPE dblp SYSTEM "dblp.dtd">) of the XML document matches, as suggested by mzjn in the comments, it is no longer giving syntax errors.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With