Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I Read and Transfer chunks of file with Hadoop WebHDFS?

I need to transfer big files (at least 14MB) from the Cosmos instance of the FIWARE Lab to my backend.

I used the Spring RestTemplate as a client interface for the Hadoop WebHDFS REST API described here but I run into an IO Exception:

Exception in thread "main" org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://cosmos.lab.fiware.org:14000/webhdfs/v1/user/<user.name>/<path>?op=open&user.name=<user.name>":Truncated chunk ( expected size: 14744230; actual size: 11285103); nested exception is org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 14744230; actual size: 11285103)
    at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:580)
    at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:545)
    at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:466)

This is the actual code that generates the Exception:

RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory());
restTemplate.getMessageConverters().add(new ByteArrayHttpMessageConverter()); 
HttpEntity<?> entity = new HttpEntity<>(headers);

UriComponentsBuilder builder = 
    UriComponentsBuilder.fromHttpUrl(hdfs_path)
        .queryParam("op", "OPEN")
        .queryParam("user.name", user_name);

ResponseEntity<byte[]> response =
    restTemplate
        .exchange(builder.build().encode().toUri(), HttpMethod.GET, entity, byte[].class);

FileOutputStream output = new FileOutputStream(new File(local_path));
IOUtils.write(response.getBody(), output);
output.close();

I think this is due to a transfer timeout on the Cosmos instance, so I tried to send a curl on the path by specifying offset, buffer and length parameters, but they seem to be ignored: I got the whole file.

Thanks in advance.

like image 752
Andrea Sassi Avatar asked Nov 28 '15 18:11

Andrea Sassi


People also ask

What is WebHDFS in Hadoop?

WebHDFS provides web services access to data stored in HDFS. At the same time, it retains the security the native Hadoop protocol offers and uses parallelism, for better throughput. To enable WebHDFS (REST API) in the name node and data nodes, you must set the value of dfs. webhdfs.

What is HttpFS?

HttpFS is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is interoperable with the webhdfs REST HTTP API.


1 Answers

Ok, I found out a solution. I don't understand why, but the transfer succeds if I use a Jetty HttpClient instead of the RestTemplate (and so Apache HttpClient). This works now:

ContentExchange exchange = new ContentExchange(true){
            ByteArrayOutputStream bos = new ByteArrayOutputStream();

            protected void onResponseContent(Buffer content) throws IOException {
                bos.write(content.asArray(), 0, content.length());
            }

            protected void onResponseComplete() throws IOException {
                if (getResponseStatus()== HttpStatus.OK_200) {
                    FileOutputStream output = new FileOutputStream(new File(<local_path>));
                    IOUtils.write(bos.toByteArray(), output);
                    output.close();
                }
            }

        };

UriComponentsBuilder builder = UriComponentsBuilder.fromHttpUrl(<hdfs_path>)
                .queryParam("op", "OPEN")
                .queryParam("user.name", <user_name>);

exchange.setURL(builder.build().encode().toUriString());
exchange.setMethod("GET");
exchange.setRequestHeader("X-Auth-Token", <token>);

HttpClient client = new HttpClient();
client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
client.setMaxConnectionsPerAddress(200);
client.setThreadPool(new QueuedThreadPool(250)); 
client.start();
client.send(exchange);
exchange.waitForDone();

Is there any known bug on the Apache Http Client for chunked files transfer?

Was I doing something wrong in my RestTemplate request?

UPDATE: I still don't have a solution

After few tests I see that I don't have solved my problems. I found out that the hadoop version installed on the Cosmos instance is quite old Hadoop 0.20.2-cdh3u6 and I read that WebHDFS doesn't support partial file transfer with length parameter (introduced since v 0.23.3). These are the headers I received from the Server when I send a GET request using curl:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: HEAD, POST, GET, OPTIONS, DELETE
Access-Control-Allow-Headers: origin, content-type, X-Auth-Token, Tenant-ID, Authorization
server: Apache-Coyote/1.1
set-cookie: hadoop.auth="u=<user>&p=<user>&t=simple&e=1448999699735&s=rhxMPyR1teP/bIJLfjOLWvW2pIQ="; Version=1; Path=/
Content-Type: application/octet-stream; charset=utf-8
content-length: 172934567
date: Tue, 01 Dec 2015 09:54:59 GMT
connection: close

As you see the Connection header is set to close. Actually, the connection is usually closed each time the GET request lasts more than 120 seconds, even if the file transfer has not been completed.

In conclusion, I can say that Cosmos is totally useless if it doesn't support large file transfer.

Please correct me if I'm wrong, or if you know a workaround.

like image 136
Andrea Sassi Avatar answered Oct 08 '22 04:10

Andrea Sassi