<p>Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:</p> <pre class="prettyprint"><code><div> <p><strong>Score:</strong>2.5/5</p> <p><strong>Director:</strong> Bryan Singer</p> </div> <div>some other info which I dont need</div> </code></pre> <p>Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:</p> <pre class="prettyprint"><code><div> <p><strong>Score:</strong>{MOVIE_RATING}</p> <p><strong>Director:</strong>{MOVIE_DIRECTOR}</p> </div> <div>some other info which I dont need</div> </code></pre> <p>Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. <strong>{MOVIE_RATING}</strong>, <strong>{MOVIE_DIRECTOR}</strong>) and extract the values from the actual webpages.</p> <p>This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:</p> <pre class="prettyprint"><code><div> <div><b>Rating:</b>**1/2</div> <div><i>Director:</i>Singer, Bryan</div> </div> </code></pre> <p>and the corresponding template will look like this:</p> <pre class="prettyprint"><code><div> <div><b>Rating:</b>{MOVIE_RATING}</div> <div><i>Director:</i>{MOVIE_DIRECTOR}</div> </div> </code></pre> <p>Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.</p> <p>Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?</p> <p><strong>Note:</strong> <em>While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.</em></p>

<blockquote> <p><em>The approach you gave is pretty much similar to the Gilbert's except the regex part. I don't want to step into the ugly regex world, I am planning to use template approach for many other areas apart from movie info e.g. prices, product specs extraction etc.</em></p> </blockquote> <ol> <li><p>The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found. </p></li> <li><p>Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as<code>"<b/>"</code> in a much more flexible manner - ignoring it and allowing it to change without breaking.</p></li> <li> <p>When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?</p> <p>3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the <code>java.util.regex</code> format. Anything else is a development burden and is "non-standard" and will be hard to maintain. </p> <p>3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.</p> <p>3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.</p> <p>3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.</p> </li> </ol> <p>My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.</p>

Extracting webpage information based on a template in Java

Tags:

java

text-extraction

named-entity-extraction

Right now I use Jsoup to extract certain information (not all the text) from some third party webpages, I do it periodically. This works fine until the HTML of certain webpage changes, this change leads to a change in the existing Java code, this is a tedious task, because these webpage change very frequently. Also it requires a programmer to fix the Java code. Here is an example of HTML code of my interest on a webpage:

<div>
<p><strong>Score:</strong>2.5/5</p>
<p><strong>Director:</strong> Bryan Singer</p>
</div>
<div>some other info which I dont need</div>

Now here is what I want to do, I want to save this webpage (an HTML file) locally and create a template out of it, like:

<div>
<p><strong>Score:</strong>{MOVIE_RATING}</p>
<p><strong>Director:</strong>{MOVIE_DIRECTOR}</p>
</div>
<div>some other info which I dont need</div>

Along with the actual URLs of the webpages these HTML templates will be the input to the Java program which will find out the location of these predefined keywords (e.g. {MOVIE_RATING}, {MOVIE_DIRECTOR}) and extract the values from the actual webpages.

This way I wouldn't have to modify the Java program every time a webpage changes, I will just save the webpage's HTML and replace the data with these keywords and rest will be taken care by the program. For example in future the actual HTML code may look like this:

<div>
<div><b>Rating:</b>**1/2</div>
<div><i>Director:</i>Singer, Bryan</div>
</div>

and the corresponding template will look like this:

<div>
<div><b>Rating:</b>{MOVIE_RATING}</div>
<div><i>Director:</i>{MOVIE_DIRECTOR}</div>
</div>

Also creating these kind of templates can be done by a non-programmer, anyone who can edit a file.

Now the question is, how can I achieve this in Java and is there any existing and better approach to this problem?

Note: While googling I found some research papers, but most of them require some prior learning data and accuracy is also a matter of concern.

542

asked Mar 04 '13 12:03

vikasing

1 Answers

The approach you gave is pretty much similar to the Gilbert's except the regex part. I don't want to step into the ugly regex world, I am planning to use template approach for many other areas apart from movie info e.g. prices, product specs extraction etc.

The template you describe is not actually a "template" in the normal sense of the word: a set static content that is dumped to the output with a bunch of dynamic content inserted within it. Instead, it is the "reverse" of a template - it is a parsing pattern that is slurped up & discarded, leaving the desired parameters to be found.
Because your web pages change regularly, you don't want to hard-code the content to be parsed too precisely, but want to "zoom in" on its' essential features, making the minimum of assumptions. i.e. you want to commit to literally matching key text such as "Rating:" and treat interleaving markup such as"<b/>" in a much more flexible manner - ignoring it and allowing it to change without breaking.
When you combine (1) and (2), you can give the result any name you like, but IT IS parsing using regular expressions. i.e. the template approach IS the parsing approach using a regular expression - they are one and the same. The question is: what form should the regular expression take?

3A. If you use java hand-coding to do the parsing then the obvious answer is that the regular expression format should just be the java.util.regex format. Anything else is a development burden and is "non-standard" and will be hard to maintain.

3B. If you use want to use an html-aware parser, then jsoup is a good solution. Problem is you need more text/regular expression handling and flexibility than jsoup seems to provide. It seems too locked into specific html tags and structures and so breaks when pages change.

3C. You can use a much more powerful grammar-controlled general text parser such as ANTLR - a form of backus-naur inspired grammar is used to control the parsing and generator code is inserted to process parsed data. Here, the parsing grammar expressions can be very powerful indeed with complex rules for how text is ordered on the page and how text fields and values relate to each other. The power is beyond your requirements because you are not processing a language. And there's no escaping the fact that you still need to describe the ugly bits to skip - such as markup tags etc. And wrestling with ANTLR for the first time involves educational investment before you get productivity payback.

3D. Is there a java tool that just uses a simple template type approach to give a simple answer? Well a google search doesn't give too much hope https://www.google.com/search?q=java+template+based+parser&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a. I believe that any attempt to create such a beast will degenerate into either basic regex parsing or more advanced grammar-controlled parsing because the basic requirements for matching/ignoring/replacing text drive the solution in those directions. Anything else would be too simple to actually work. Sorry for the negative view - it just reflects the problem space.

My vote is for (3A) as the simplest, most powerful and flexible solution to your needs.

196

answered Oct 03 '22 10:10

Glen Best

Related questions
                            
                                Apple MDM Vendor CSR Signing
                            
                                Get the PID of a process to kill it, without knowing its full name
                            
                                Why does Java `BitSet` not have `shiftLeft` and `shiftRight` functions?
                            
                                how to make client socket wait for data from server socket
                            
                                What namespace does the JDK use to generate a UUID with nameUUIDFromBytes?
                            
                                Use JNI to Create, Populate and Return a Java Class Instance
                            
                                Where do @Context objects come from
                            
                                Is there a clean way to assign the Class of a generic type to a variable?
                            
                                Is Java's String Intern a flyweight?
                            
                                Analyzing thread dump of a java process
                            
                                Log4J change File path dynamically
                            
                                Why are there local variables in stack-based IL bytecode
                            
                                Testing JSF application with JMeter - ViewState issue
                            
                                Selenium Webdriver remote setup
                            
                                How to configure TransactionManager programmatically
                            
                                Bad type on operand stack ... using jdk 8, lambdas with anonymous inner classes fails, why?
                            
                                What JVM does Intellij Idea use to launch with?
                            
                                How can I reduce Google App Engine datastore latency?
                            
                                AES encrypt with openssl decrypt using java
                            
                                What is the right way to sign POST requests with OAuth-Signpost and Apache HttpComponents?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With