Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing big XML files using SAX parser (skip some lines/tags)

I am currently developing an app that retrieves data from the internet using SAX. I used it before for parsing simple XML files like Google Weather API. However, the websites that I am interested in take parsing to the next level. The page is huge and looks messy. I only need to retrieve some specific lines; the rest is not useful for me.
Is it possible to skip those useless lines/tags, or do I have to go step by step?

like image 677
Amine Avatar asked Aug 05 '10 04:08

Amine


4 Answers

I like commons-digester. It allows you to specify rules against particular tags. The rule gets executed only when the tag is encountered.

Digester is built over sax and hence has all the sax features plus the specificity that is required for selectively parsing specific tags. It also uses a stack that is pushed with new elements as and when the corresponding tag is encountered and is popped when the element ends.

I use it for parsing all my configuration files.

Check out digester at http://commons.apache.org/digester/

like image 111
raja kolluru Avatar answered Nov 18 '22 18:11

raja kolluru


Yes you can do it, just ignore the tags you are not interested in. But note that the entire document will have to be parsed for this (DefaultHandler impl)

public startElement(String uri, String localName, 
     String qName, Attributes attributes)  {
  if(localName.equals("myInterestingTag") {
     // do your thing....
  }
}

public void endElement(String uri, String localName, String qName) {
  if(localName.equals("myInterestingTag") {
     // do your thing....
  }
}

public void characters(char[] ch, int start, int length) {
  // if parsing myinteresting tag... do some stuff.
}
like image 31
naikus Avatar answered Nov 18 '22 18:11

naikus


Yes, you can skip. Just define those tag which you want and it will only fetch those tag values.

like image 1
Hare-Krishna Avatar answered Nov 18 '22 19:11

Hare-Krishna


You can try to use XPath which will use SAX behind the scene to parse your xml. The downside here is that XML will be parsed on every call of Xpath evaluate method.

like image 1
Georgy Bolyuba Avatar answered Nov 18 '22 17:11

Georgy Bolyuba