We currently have some articles that get posted onto our site. they can appear with the following types of html
<p>this is an article<br>
<img src="someimage">
</p>
<p>this is an article<br>
<img src="someimage">
</p>
<p>this is an article<br>
<img src="someimage">
</p>
<p>this is an article<br>
<img src="someimage">
</p>
or
<p><img src="someimage">
this is an article<br>
</p>
<p>this is an article<br>
<img src="someimage">
</p>
<p><img src="someimage">
this is an article<br>
</p>
Some other html tags may be inside this sometimes, I cant get my head around how to scrape the page using coldfusion to achieve this
Esentially what i need to do is grab hold of the first paragraph text and image and be able to arrange it.
Is this possible using Coldfusion 8 ? Would anyone be able to point me in the direction on how to learn this ?
100% definitely possible!
Now, don't be put off by what I'm going to suggest, it's actually very easy to get going with this.
Download a library called jSoup...it's sole purpose is for scraping contents from the DOM in a web page:
http://jsoup.org/
You would then use this Java class by doing something like:
<!--- Get the page. --->
<cfhttp method="get" url="http://example.com/" resolveurl="true" useragent="#cgi.http_user_agent#" result="myPage" timeout="10" charset="utf-8">
<cfhttpparam type="header" name="Accept-Encoding" value="*" />
<cfhttpparam type="header" name="TE" value="deflate;q=0" />
</cfhttp>
<!--- Load up jSoup and parse the document with it. --->
<cfset jsoup = createObject("java", "org.jsoup.Jsoup") />
<cfset document = jsoup.parse(myPage.filecontent) />
<!--- Search the parsed document for the contents of the TITLE tag. --->
<cfset title = document.select("title").first() />
<!--- Let's see what we got. --->
<cfdump var="#title#" />
This example is pretty simple but it can show you just how easy it is to work with. Scraping images and whatever else would be fairly easy if you check out the docs on jSoup.
There are some good examples on this page, where you can use CSS style selectors:
http://jsoup.org/cookbook/extracting-data/selector-syntax
Try to avoid using Regex for this task - believe me, I've tried and it's an absolute can of worms!
Hope this helps. Mikey.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With