Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Coldfusion - Simple HTML Parsing

We currently have some articles that get posted onto our site. they can appear with the following types of html

<p>this is an article<br>
<img src="someimage">
</p>

<p>this is an article<br>
<img src="someimage">
</p>

<p>this is an article<br>
<img src="someimage">
</p>

<p>this is an article<br>
<img src="someimage">
</p>

or

<p><img src="someimage">
this is an article<br>
</p>
<p>this is an article<br>
<img src="someimage">
</p>
<p><img src="someimage">
this is an article<br>
</p>

Some other html tags may be inside this sometimes, I cant get my head around how to scrape the page using coldfusion to achieve this

Esentially what i need to do is grab hold of the first paragraph text and image and be able to arrange it.

Is this possible using Coldfusion 8 ? Would anyone be able to point me in the direction on how to learn this ?

like image 372
user125264 Avatar asked Jun 02 '26 06:06

user125264


1 Answers

100% definitely possible!

Now, don't be put off by what I'm going to suggest, it's actually very easy to get going with this.

Download a library called jSoup...it's sole purpose is for scraping contents from the DOM in a web page:

http://jsoup.org/

You would then use this Java class by doing something like:

<!--- Get the page. --->
<cfhttp method="get" url="http://example.com/" resolveurl="true" useragent="#cgi.http_user_agent#" result="myPage" timeout="10" charset="utf-8">
<cfhttpparam type="header" name="Accept-Encoding" value="*" />   
<cfhttpparam type="header" name="TE" value="deflate;q=0" />        
</cfhttp>

<!--- Load up jSoup and parse the document with it. --->
<cfset jsoup = createObject("java", "org.jsoup.Jsoup") />
<cfset document = jsoup.parse(myPage.filecontent) />

<!--- Search the parsed document for the contents of the TITLE tag. --->
<cfset title = document.select("title").first() />

<!--- Let's see what we got. --->
<cfdump var="#title#" />

This example is pretty simple but it can show you just how easy it is to work with. Scraping images and whatever else would be fairly easy if you check out the docs on jSoup.

There are some good examples on this page, where you can use CSS style selectors:

http://jsoup.org/cookbook/extracting-data/selector-syntax

Try to avoid using Regex for this task - believe me, I've tried and it's an absolute can of worms!

Hope this helps. Mikey.

like image 197
Michael Giovanni Pumo Avatar answered Jun 04 '26 03:06

Michael Giovanni Pumo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!