Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XML parsing in Ruby

I am using a REXML Ruby parser to parse an XML file. But on a 64 bit AIX box with 64 bit Ruby, I am getting the following error:

REXML::ParseException: #<REXML::ParseException: #<RegexpError: Stack overflow in 
regexp matcher: 
/^<((?>(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*))\s*((?>\s+(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*\s*=\s*(["']).*?\3)*)\s*(\/)?>/mu>

The call for the same is something like this:

REXML::Document.new(File.open(actual_file_name, "r"))

Does anyone have an idea regarding how to solve this issue?

like image 588
Ricketyship Avatar asked Dec 03 '22 00:12

Ricketyship


2 Answers

I've had several issues for REXML, it doesn't seem to be the most mature library. Usually I use Nokogiri for Ruby XML parsing stuff, it should be faster and more stable than REXML. After installing it with sudo gem install nokogiri, you can use something like this to get a DOM instance:

doc = Nokogiri.XML(File.open(actual_file_name, 'rb'))
# => #<Nokogiri::XML::Document:0xf1de34 name="document" [...] >

The documentation on the official webpage is also much better than that of REXML, IMHO.

like image 81
Niklas B. Avatar answered Dec 04 '22 13:12

Niklas B.


I almost immediately found the answer.

The first thing I did was to search in the ruby source code for the error being thrown. I found that regex.h was responsible for this.

In regex.h, the code flow is something like this:

/* Maximum number of duplicates an interval can allow.  */
#ifndef RE_DUP_MAX
#define RE_DUP_MAX  ((1 << 15) - 1)
#endif

Now the problem here is RE_DUP_MAX. On AIX box, the same constant has been defined somewhere in /usr/include. I searched for it and found in

/usr/include/NLregexp.h
/usr/include/sys/limits.h
/usr/include/unistd.h

I am not sure which of the three is being used(most probably NLregexp.h). In these headers, the value of RE_DUP_MAX has been set to 255! So there is a cap placed on the number of repetitions of a regex!

In short, the reason is the compilation taking the system defined value than that we define in regex.h!

This also answers my question which i had asked recently: Regex limit in ruby 64 bit aix compilation

I was not able to answer it immediately as i need to have min of 100 reputation :D :D Cheers!

like image 41
2 revs Avatar answered Dec 04 '22 12:12

2 revs