Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a file in hadoop with python

Tags:

python

hadoop

Question:

I am starting to learn hadoop, however, I need to save a lot of files into it using python. I cannot seem to figure out what i am doing wrong. Can anyone help me with this?

Below is my code. I think the HDFS_PATH is correct as I didn't change it in the settings while installing. the pythonfile.txt is on my desktop (so is the python code running through the command line).

Code:

import hadoopy
import os
hdfs_path ='hdfs://localhost:9000/python' 

def main():
    hadoopy.writetb(hdfs_path, [('pythonfile.txt',open('pythonfile.txt').read())])

main()

Output When I run the above code all I get is a directory in python itself.

iMac-van-Brian:desktop Brian$ $HADOOP_HOME/bin/hadoop dfs -ls /python

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

14/10/28 11:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 Brian supergroup        236 2014-10-28 11:30 /python
like image 376
user3671459 Avatar asked Oct 28 '14 10:10

user3671459


2 Answers

This is a pretty typical task for the subprocess module. The solution looks like this:

put = Popen(["hadoop", "fs", "-put", <path/to/file>, <path/to/hdfs/file], stdin=PIPE, bufsize=-1)
put.communicate()

Full Example

Let's assume you're on a server and have a verified connection with hdfs (e.g. you already called a .keytab).

You just created a csv from a pandas.DataFrame and want to put it into hdfs.

You can then upload the file to hdfs as follows:

import os 

import pandas as pd

from subprocess import PIPE, Popen


# define path to saved file
file_name = "saved_file.csv"

# create a pandas.DataFrame
sales = {'account': ['Jones LLC', 'Alpha Co', 'Blue Inc'], 'Jan': [150, 200, 50]}
df = pd.DataFrame.from_dict(sales)

# save your pandas.DataFrame to csv (this could be anything, not necessarily a pandas.DataFrame)
df.to_csv(file_name)

# create path to your username on hdfs
hdfs_path = os.path.join(os.sep, 'user', '<your-user-name>', file_name)

# put csv into hdfs
put = Popen(["hadoop", "fs", "-put", file_name, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()

The csv file will then exist at /user/<your-user-name/saved_file.csv.

Note - If you created this file from a python script called in Hadoop, the intermediate csv file may be stored on some random nodes. Since this file is (presumably) no longer needed, it's best practice to remove it so as not to pollute the nodes everytime the script is called. You can simply add os.remove(file_name) as the last line of the above script to solve this issue.

like image 174
Jared Wilber Avatar answered Oct 01 '22 02:10

Jared Wilber


I have a feeling that you're writing into a file called '/python' while you intend it to be the directory in which the file is stored

what does

hdfs dfs -cat /python

show you?

if it shows the file contents, all you need to do is edit your hdfs_path to include the file name (you should delete /python first with -rm) Otherwise, use pydoop (pip install pydoop) and do this:

import pydoop.hdfs as hdfs

from_path = '/tmp/infile.txt'
to_path ='hdfs://localhost:9000/python/outfile.txt'
hdfs.put(from_path, to_path)
like image 22
Legato Avatar answered Oct 01 '22 01:10

Legato