Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python write to hdfs file

What is the best way to create/write/update a file in remote HDFS from local python script?

I am able to list files and directories but writing seems to be a problem.

I have searched hdfs and snakebite but none of them give a clean way to do this.

like image 760
nishant Avatar asked Dec 21 '17 14:12

nishant


People also ask

Can pandas write to HDFS?

The use case is simple. We need to write the contents of a Pandas DataFrame to Hadoop's distributed filesystem, known as HDFS. We can call this work an HDFS Writer Micro-service, for example. In our case we can make it a tiny bit more complex (and realistic) by adding a Kerberos security requirement.

How do I use Pyspark to write in HDFS?

You can try using the underlying java classes available through the SparkSession (tested in Spark 3.1, but should also work for Spark 2). The dataStream. write() method takes in bytes so can be used to write arbitrary binary data, or you can find methods that take other types here.


2 Answers

try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write

Example:

to create connection:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

from json import dump, dumps
records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')

For CSV you can do

import pandas as pd
df=pd.read.csv("file.csv")
with client_hdfs.write('path/output.csv', encoding = 'utf-8') as writer:
  df.to_csv(writer)
like image 72
Andy_101 Avatar answered Oct 11 '22 22:10

Andy_101


What's wrong with other answers

They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.

This is what the upload function of that hdfs library you linked to uses.

Native (more secure) ways to write to HDFS using Python

You can use pyspark.

Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


snakebite has been mentioned, but it doesn't write files


pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.

like image 33
OneCricketeer Avatar answered Oct 11 '22 22:10

OneCricketeer