I am looking for different options through which I can write data directly into hdfs using python without storing on the local node and then using copyfromlocal.
I would like to use hdfs file similar to local file and use write method with the line as the argument, something of the following:
hdfs_file = hdfs.create("file_tmp")
hdfs_file.write("Hello world\n")
Does there exist something similar to the use case described above?
Hadoop Streaming is another feature/utility of the Hadoop ecosystem that enables users to execute a MapReduce job from an executable script as Mapper and Reducers. Hadoop Streaming is often confused with real-time streaming, but it's simply a utility that runs an executable script in the MapReduce framework.
In order to copy a file from the local file system to HDFS, use Hadoop fs -put or hdfs dfs -put, on put command, specify the local-file-path where you wanted to copy from and then HDFS-file-path where you wanted to copy to. If the file already exists on HDFS, you will get an error message saying “File already exists”.
Step 1: Make a directory in HDFS where you want to copy this file with the below command. Step 2: Use copyFromLocal command as shown below to copy it to HDFS /Hadoop_File directory. Step 3: Check whether the file is copied successfully or not by moving to its directory location with below command.
Im not sure about a python hdfs library, but you can always stream via a hadoop fs put command and denote copying from stdin using '-' as the source filename:
hadoop fs -put - /path/to/file/in/hdfs.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With