Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing data to Hadoop

Tags:

hadoop

hdfs

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.

like image 242
Steve Severance Avatar asked Oct 07 '09 18:10

Steve Severance


People also ask

How a client read and write data in HDFS?

The client interacts with HDFS DataNodeThe client will show the security tokens provided by NameNode to the DataNodes and start reading data from the DataNode. The data will flow directly from the DataNode to the client.

How reading and writing of text data can be done in Hadoop?

Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System). Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file.

What is the first step in a write process from a HDFS client?

To write a file inside the HDFS, the client first interacts with the NameNode. NameNode first checks for the client privileges to write a file. If the client has sufficient privilege and there is no file existing with the same name, NameNode then creates a record of a new file.


2 Answers

There is an API in Java. You can use it by including the Hadoop code in your project. The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g * http://hadoop.apache.org/common/docs/

For your particular problem, have a look at: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html (this applies to the latest release, consult other JavaDocs for different versions!)

A typical call would be: Filesystem.get(new JobConf()).create(new Path("however.file")); Which returns you a stream you can handle with regular JavaIO.

like image 83
Peter Wippermann Avatar answered Sep 20 '22 20:09

Peter Wippermann


For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.

Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).

The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.

This gives me excellent transfer performance, since multiple files are read/written at the same time.

Maybe not the answer you were looking for, but hopefully helpful anyway :-).

like image 40
Erik Forsberg Avatar answered Sep 16 '22 20:09

Erik Forsberg