I wand to develop http client in Java for college project which login to site, obtain data from HTML data, complete and send forms. I don't know which http lib to use : Apache HTTP client - don't create DOM model but work with http redirects, multi threading. HTTPUnit - create DOM model and is easy to work with forms, fields, tables etc. but I don't know how will work with multi-threading and proxy settings.
Any advice ?
It sounds like you are trying to create a web-scraping application. For this purpose, I recommend the HtmlUnit library.
It makes it easy to work with forms, proxies, and data embedded in web pages. Under the hood I think it uses Apache's HttpClient to handle HTTP requests, but this is probably too low-level for you to be worried about.
With this library you can control a web page in Java the same way you would control it in a web browser: clicking a button, typing text, selecting values.
Here are some examples from HtmlUnit's getting started page:
Submitting a form:
@Test
public void submittingForm() throws Exception {
final WebClient webClient = new WebClient();
// Get the first page
final HtmlPage page1 = webClient.getPage("http://some_url");
// Get the form that we are dealing with and within that form,
// find the submit button and the field that we want to change.
final HtmlForm form = page1.getFormByName("myform");
final HtmlSubmitInput button = form.getInputByName("submitbutton");
final HtmlTextInput textField = form.getInputByName("userid");
// Change the value of the text field
textField.setValueAttribute("root");
// Now submit the form by clicking the button and get back the second page.
final HtmlPage page2 = button.click();
webClient.closeAllWindows();
}
Using a proxy server:
@Test
public void homePage_proxy() throws Exception {
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_2, "http://myproxyserver", myProxyPort);
//set proxy username and password
final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
credentialsProvider.addProxyCredentials("username", "password");
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
webClient.closeAllWindows();
}
The WebClient
class is single threaded, so every thread that deals with a web page will need its own WebClient
instance.
Unless you need to process Javascript or CSS, you can also disable these when you create the client:
WebClient client = new WebClient();
client.setJavaScriptEnabled(false);
client.setCssEnabled(false);
HTTPUnit is meant for testing purposes, I don't think it is best suited to be embedded inside your application.
When you want to consume HTTP resources (like webpages) I'd recommend Apache HTTPClient. But you may find this framework to low level for your use case which is webpage scraping. So I'd recommend an integration framework like Apache Camel for this purpose. For example the following route reads a webpage (using Apache HTTPClient), transforms the HTML to well-formed HTML (using TagSoup) and transforms the result to a XML representation for further processing.
from("http://mycollege.edu/somepage.html).unmarshall().tidyMarkup().to("xslt:mystylesheet.xsl")
You can further process the resulting XML using XPath or transform it to a POJO using JAXB for example.
HTTPUnit is for unit testing. Unless you mean "testing client", I don't think it's appropriate for creating an application.
I wand to develop http client in Java
You realize, of course, that the Apache HTTP client is not your answer either. You sound like you want to create a first web app.
You'll need servlets and JSPs. Get Apache's Tomcat and learn enough JSP and JSTL to do what you need to do. Don't bother with frameworks, since it's your first.
When you have it running, then try a framework like Spring.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With