What is the fastest way to scrape HTML webpage in Android?

Tags:

I need to extract information from an unstructured web page in Android. The information I want is embedded in a table that doesn't have an id.

<table>  <tr><td>Description</td><td></td><td>I want this field next to the description cell</td></tr>  </table>

Should I use

Pattern Matching?
Use BufferedReader to extract the information?

Or are there faster way to get that information?

263

asked Jun 04 '10 02:06

unj2

2 Answers

I think in this case it makes no sense to look for a fast way to extract the information as there is virtually no performance difference between the methods already suggested in answers when you compare it to the time it will take to download the HTML.

So assuming that by fastest you mean most convenient, readable and maintainable code, I suggest you use a DocumentBuilder to parse the relevant HTML and extract data using XPathExpressions:

Document doc = DocumentBuilderFactory.newInstance()   .newDocumentBuilder().parse(new InputSource(new StringReader(html)));  XPathExpression xpath = XPathFactory.newInstance()   .newXPath().compile("//td[text()=\"Description\"]/following-sibling::td[2]");  String result = (String) xpath.evaluate(doc, XPathConstants.STRING);

If you happen to retrieve invalid HTML, I recommend to isolate the relevant portion (e.g. using substring(indexOf("<table")..) and if necessary correct remaining HTML errors with String operations before parsing. If this gets too complex however (i.e. very bad HTML), just go with the hacky pattern matching approach as suggested in other answers.

Remarks

XPath is available since API Level 8 (Android 2.2). If you develop for lower API levels you can use DOM methods and conditionals to navigate to the node you want to extract

answered Oct 08 '22 18:10

Josef Pfleger

The fastest way will be parsing the specific information yourself. You seem to know the HTML structure precisely beforehand. The BufferedReader, String and StringBuilder methods should suffice. Here's a kickoff example which displays the first paragraph of your own question:

public static void main(String... args) throws Exception {     URL url = new URL("http://stackoverflow.com/questions/2971155");     BufferedReader reader = null;     StringBuilder builder = new StringBuilder();     try {         reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));         for (String line; (line = reader.readLine()) != null;) {             builder.append(line.trim());         }     } finally {         if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}     }      String start = "<div class=\"post-text\"><p>";     String end = "</p>";     String part = builder.substring(builder.indexOf(start) + start.length());     String question = part.substring(0, part.indexOf(end));     System.out.println(question); }

Parsing is in practically all cases definitely faster than pattern matching. Pattern matching is easier, but there is a certain risk that it may yield unexpected results, certainly when using complex regex patterns.

You can also consider to use a more flexible 3rd party HTML parser instead of writing one yourself. It will not be as fast as parsing yourself with beforehand known information. It will however be more concise and flexible. With decent HTML parsers the difference in speed is pretty negligible. I strongly recommend Jsoup for this. It supports jQuery-like CSS selectors. Extracting the firsrt paragraph of your question would then be as easy as:

public static void main(String... args) throws Exception {     Document document = Jsoup.connect("http://stackoverflow.com/questions/2971155").get();     String question = document.select("#question .post-text p").first().text();     System.out.println(question); }

It's unclear what web page you're talking about, so I can't give a more detailed example how you could select the specific information from the specific page using Jsoup. If you still can't figure it at your own using Jsoup and CSS selectors, then feel free to post the URL in a comment and I'll suggest how to do it.

answered Oct 08 '22 18:10

BalusC

Related questions
                            
                                Google map for android my location custom button
                            
                                StaggeredGridLayoutManager and moving items
                            
                                Android Studio ADB wipes out logcat files when app crashes! Ohh Myy
                            
                                how to get the onWindowFocusChanged on Fragment?
                            
                                Is there any way to control views inside NavigationView header?
                            
                                Cannot resolve symbol AndroidSchedulers
                            
                                ConstraintLayout - centering views with next to each other vertically or horizontally
                            
                                Build fails with 'Program type already present: android.arch.core.util.Function'
                            
                                How can I deliver parameters to a test function, that launched using adb shell am Instrumentation command
                            
                                Remove Icon but have HomeAsUp in ActionBar
                            
                                Invoking JNI functions in Android package name containing underscore
                            
                                splash screen application and hide action bar
                            
                                Invert colors of drawable
                            
                                Failed to convert @drawable/ into a drawable
                            
                                Android Linear Layout Weight Programmatically
                            
                                Expand Search View to use entire Action Bar (hide other things)
                            
                                Xamarin - Android - Visual Studio - The application could not be started
                            
                                What is the difference between enter/exit and popEnter/popExit animations?
                            
                                Split space from string not working in Kotlin
                            
                                How to see XML files code in Android Studio 3.6.1 [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest way to scrape HTML webpage in Android?

Tags:

html

android

web-scraping

unj2

People also ask

2 Answers

Josef Pfleger

BalusC

Recent Activity

Donate For Us