Get a list of file names from HDFS using python

2 Answers

As far as I've been able to tell there is no out-of-the-box solution for this, and most answers I've found have resorted to using calls to the hdfs command. I'm running on Linux, and have the same challenge. I've found the sh package to be useful. This handles running o/s commands for you and managing stdin/out/err.

See here for more info on it: https://amoffat.github.io/sh/

Not the neatest solution, but it's one line (ish) and uses standard packages.

Here's my cut-down code to grab an HDFS directory listing. It will list files and folders alike, so you might need to modify if you need to differentiate between them.

import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

My output - In this case these are all directories:

[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']

Let's break it down:

To run the hdfs dfs -ls /somedirectory command we can use the sh package like this:

import sh
sh.hdfs('dfs','-ls',hdfsdir)

sh allows you to call o/s commands seamlessly as if they were functions on the module. You pass command parameters as function parameters. Really neat.

For me this returns something like:

Found 366 items
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05

Split that into lines based on new line characters using .split('\n')

Obtain the last 'word' in the string using line.rsplit(None,1)[-1].

To prevent issues with empty elements in the list use if len(line.rsplit(None,1))

Finally remove the first element in the list (the Found 366 items) using [1:]

198

answered Nov 03 '22 15:11

JGC

what do I need to have on my computer?

You need Hadoop installed and running and ofcourse, Python.

How do I query for filenames on HDFS ?

You can try something like this here. I haven't tested the code so don't just rely on it.

from subprocess import Popen, PIPE

process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()

check for returncode, std_err
if:
    everything is OK, do whatever with stdout
else:
    do something in else condition

You can also look at Pydoop which is a Python API for Hadoop.

Although my example include shell=true, you can try running without it as it is a security risk. Why you shouldn't use shell=True?

answered Nov 03 '22 15:11

sam

Related questions
                            
                                Using parametrize for keyword arguments in pytest
                            
                                Determine if all elements in a list are present and in the same order in another list
                            
                                Can't save matplotlib animation
                            
                                How to get output of pandas .plot(kind='kde')?
                            
                                Installing package dependencies for Scrapy
                            
                                Overloading (or alternatives) in Python API design
                            
                                python datetime gets object has no attribute today error
                            
                                Why isn't Odoo picking up my module?
                            
                                Unique Salt per User using Flask-Security
                            
                                Getting Django 1.7 to work on Google App Engine
                            
                                pandas scatter matrix display correlation coefficient
                            
                                Reading a Matlab's cell array saved as a v7.3 .mat file with H5py
                            
                                List of unicode character names
                            
                                Iterate over non None items in Python
                            
                                Does Python really create all bound method for every new instance?
                            
                                show matplotlib colorbar instead of legend for multiple plots with gradually changing colors
                            
                                How to replace many 'if...elif' statements in Python? [duplicate]
                            
                                Runtime error when trying to logout django
                            
                                pandas - plot sorted column to increasing integer index
                            
                                What are the parentheses for at the end of Python method names? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get a list of file names from HDFS using python

Tags:

python

hadoop

Raaj

People also ask

2 Answers

JGC

sam

Recent Activity

Donate For Us