Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get a list of file names from HDFS using python

Tags:

python

hadoop

Hadoop noob here.

I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.

As a part of Hadoop cluster, there are a bunch of .dat files on the HDFS.

In order to access those files on my client (local computer) using Python,

what do I need to have on my computer?

How do I query for filenames on HDFS ?

Any links would be helpful too.

like image 931
Raaj Avatar asked Sep 03 '15 17:09

Raaj


People also ask

How do I list files in HDFS?

Use the hdfs dfs -ls command to list files in Hadoop archives. Run the hdfs dfs -ls command by specifying the archive directory location. Note that the modified parent argument causes the files to be archived relative to /user/ .

How do I list all files in HDFS and size?

You can use hadoop fs -ls command to list files in the current directory as well as their details. The 5th column in the command output contains file size in bytes. The size of file sou is 45956 bytes.

Which command is used to list the contents in the HDFS?

ls: List directories present under a specific directory in HDFS, similar to Unix ls command. The -lsr command can be used for recursive listing of directories and files.


2 Answers

As far as I've been able to tell there is no out-of-the-box solution for this, and most answers I've found have resorted to using calls to the hdfs command. I'm running on Linux, and have the same challenge. I've found the sh package to be useful. This handles running o/s commands for you and managing stdin/out/err.

See here for more info on it: https://amoffat.github.io/sh/

Not the neatest solution, but it's one line (ish) and uses standard packages.

Here's my cut-down code to grab an HDFS directory listing. It will list files and folders alike, so you might need to modify if you need to differentiate between them.

import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

My output - In this case these are all directories:

[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']

Let's break it down:

To run the hdfs dfs -ls /somedirectory command we can use the sh package like this:

import sh
sh.hdfs('dfs','-ls',hdfsdir)

sh allows you to call o/s commands seamlessly as if they were functions on the module. You pass command parameters as function parameters. Really neat.

For me this returns something like:

Found 366 items
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05

Split that into lines based on new line characters using .split('\n')

Obtain the last 'word' in the string using line.rsplit(None,1)[-1].

To prevent issues with empty elements in the list use if len(line.rsplit(None,1))

Finally remove the first element in the list (the Found 366 items) using [1:]

like image 198
JGC Avatar answered Nov 03 '22 15:11

JGC


what do I need to have on my computer?

You need Hadoop installed and running and ofcourse, Python.

How do I query for filenames on HDFS ?

You can try something like this here. I haven't tested the code so don't just rely on it.

from subprocess import Popen, PIPE

process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()

check for returncode, std_err
if:
    everything is OK, do whatever with stdout
else:
    do something in else condition

You can also look at Pydoop which is a Python API for Hadoop.

Although my example include shell=true, you can try running without it as it is a security risk. Why you shouldn't use shell=True?

like image 39
sam Avatar answered Nov 03 '22 15:11

sam