I am a complete Perl newb, but I am certain that learning Perl will be easier than figuring out how to parse XML in awk. I would like to parse the .sgm files from this dataset:
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
This is a collection of 20,000 Reuters articles from newswire from a decade ago, and is a standard test set for certain types of text processing. To simplify my perl testing, I grabbed the first few hundred lines from the first file and made test.sgm until my script worked correctly on that. It starts out like this:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="5544" NEWID="1">
<DATE>26-FEB-1987 15:01:01.79</DATE>
<TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
C T
f0704reute
u f BC-BAHIA-COCOA-REVIEW 02-26 0105</UNKNOWN>
<TEXT>
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE><BODY>Showers continued throughout the week in
the Bahia cocoa zone, alleviating the drought since early
January and improving prospects for the coming temporao,...
I used a perl script from http://www.xml.com/pub/a/2001/05/16/perlxml.html as an example, and ended up with this, extract.pl:
use XML::DOM;
my $file = $ARGV[0];
my $parser = XML::DOM::Parser->new();
my $doc = $parser->parsefile($file);
#print $doc->getElementsByTagName('DATE');
print "\n";
and I get this output:
> perl extract.pl test.sgm
reference to invalid character number at line 11, column 0, byte 343 at /usr/lib64/perl5/vendor_perl/5.8.5/x86_64-linux-thread-multi/XML/Parser.pm line 187
>
Google doesn't help (the top hit appears to be a page that is experiencing the same error I am) and my Perl hacker friend is still hung over from Blackhat in Vegas. Any ideas what I'm doing wrong, or how I can clean the file? I assume the badness is happening inside that "Unknown" tag, which I don't even need. I really just want to extract the text from every article. If you need more info please let me know.
The numeric character reference "" is not legal in valid XML Documents. I refer you to the section 4.1 Character and Entity References in the XML recommendation:
Characters referred to using character references MUST match the production for Char.
Now if we follow the link and look at the production for Char:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
we see that there are some characters that can appear neither literally, nor as a numeric character reference in a valid XML Document.
An oddity that; I've learned something about XML today :).
See this conversation on ASCII control characters in XML for a possible workaround.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With