I need to transfer big files (at least 14MB) from the Cosmos instance of the FIWARE Lab to my backend.
I used the Spring RestTemplate as a client interface for the Hadoop WebHDFS REST API described here but I run into an IO Exception:
Exception in thread "main" org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://cosmos.lab.fiware.org:14000/webhdfs/v1/user/<user.name>/<path>?op=open&user.name=<user.name>":Truncated chunk ( expected size: 14744230; actual size: 11285103); nested exception is org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 14744230; actual size: 11285103)
at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:580)
at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:545)
at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:466)
This is the actual code that generates the Exception:
RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory());
restTemplate.getMessageConverters().add(new ByteArrayHttpMessageConverter());
HttpEntity<?> entity = new HttpEntity<>(headers);
UriComponentsBuilder builder =
UriComponentsBuilder.fromHttpUrl(hdfs_path)
.queryParam("op", "OPEN")
.queryParam("user.name", user_name);
ResponseEntity<byte[]> response =
restTemplate
.exchange(builder.build().encode().toUri(), HttpMethod.GET, entity, byte[].class);
FileOutputStream output = new FileOutputStream(new File(local_path));
IOUtils.write(response.getBody(), output);
output.close();
I think this is due to a transfer timeout on the Cosmos instance, so I tried to
send a curl
on the path by specifying offset, buffer and length
parameters, but they seem to be ignored: I got the whole file.
Thanks in advance.
WebHDFS provides web services access to data stored in HDFS. At the same time, it retains the security the native Hadoop protocol offers and uses parallelism, for better throughput. To enable WebHDFS (REST API) in the name node and data nodes, you must set the value of dfs. webhdfs.
HttpFS is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is interoperable with the webhdfs REST HTTP API.
Ok, I found out a solution. I don't understand why, but the transfer succeds if I use a Jetty HttpClient instead of the RestTemplate (and so Apache HttpClient). This works now:
ContentExchange exchange = new ContentExchange(true){
ByteArrayOutputStream bos = new ByteArrayOutputStream();
protected void onResponseContent(Buffer content) throws IOException {
bos.write(content.asArray(), 0, content.length());
}
protected void onResponseComplete() throws IOException {
if (getResponseStatus()== HttpStatus.OK_200) {
FileOutputStream output = new FileOutputStream(new File(<local_path>));
IOUtils.write(bos.toByteArray(), output);
output.close();
}
}
};
UriComponentsBuilder builder = UriComponentsBuilder.fromHttpUrl(<hdfs_path>)
.queryParam("op", "OPEN")
.queryParam("user.name", <user_name>);
exchange.setURL(builder.build().encode().toUriString());
exchange.setMethod("GET");
exchange.setRequestHeader("X-Auth-Token", <token>);
HttpClient client = new HttpClient();
client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
client.setMaxConnectionsPerAddress(200);
client.setThreadPool(new QueuedThreadPool(250));
client.start();
client.send(exchange);
exchange.waitForDone();
Is there any known bug on the Apache Http Client for chunked files transfer?
Was I doing something wrong in my RestTemplate request?
After few tests I see that I don't have solved my problems.
I found out that the hadoop version installed on the Cosmos instance is quite old Hadoop 0.20.2-cdh3u6 and I read that WebHDFS doesn't support partial file transfer with length
parameter (introduced since v 0.23.3).
These are the headers I received from the Server when I send a GET request using curl
:
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: HEAD, POST, GET, OPTIONS, DELETE
Access-Control-Allow-Headers: origin, content-type, X-Auth-Token, Tenant-ID, Authorization
server: Apache-Coyote/1.1
set-cookie: hadoop.auth="u=<user>&p=<user>&t=simple&e=1448999699735&s=rhxMPyR1teP/bIJLfjOLWvW2pIQ="; Version=1; Path=/
Content-Type: application/octet-stream; charset=utf-8
content-length: 172934567
date: Tue, 01 Dec 2015 09:54:59 GMT
connection: close
As you see the Connection header is set to close. Actually, the connection is usually closed each time the GET request lasts more than 120 seconds, even if the file transfer has not been completed.
In conclusion, I can say that Cosmos is totally useless if it doesn't support large file transfer.
Please correct me if I'm wrong, or if you know a workaround.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With