Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala and HTML parsing

Tags:

How do you load an HTML DOM document into Scala? The XML singleton had errors when trying to load the xmlns tags.

import java.net._
import java.io._
import scala.xml._

object NetParse {

   import java.net.{URLConnection, URL}
   import scala.xml._

   def netParse(sUrl: String): Elem = {
       var url = new URL(sUrl)
       var connect = url.openConnection

       XML.load(connect.getInputStream)
   }
}

Finally I found a solution! - Requires scala 2.7.7 or higher to work (2.7.0 has a fatal bug): How-to-use-TagSoup-with-Scala-XML

like image 855
Luigimax Avatar asked Nov 08 '09 09:11

Luigimax


3 Answers

This may help you Processing real world HTML as if it were XML in scala

like image 186
priyanka.sarkar Avatar answered Sep 22 '22 01:09

priyanka.sarkar


Try using scala.xml.parsing.XhtmlParser instead.

like image 5
Daniel C. Sobral Avatar answered Sep 22 '22 01:09

Daniel C. Sobral


I have just tried to use this answer with scala 2.8.1 and ended up using the work from:

http://www.hars.de/2009/01/html-as-xml-in-scala.html

The interesting bit that I needed was:

val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.scala-lang.org")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
adapter.loadXML(source, parser)
like image 5
Jesse Eichar Avatar answered Sep 22 '22 01:09

Jesse Eichar