Get list of files from hdfs (hadoop) directory using python script

Question

How to get a list of files from hdfs (hadoop) directory using python script?

I have tried with following line:

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

The directory have list of files "file1,file2,file3....fileN". By using the line i got all the content list only. But i need to get list of file names.

Can anyone please help me to find out this problem?

Thanks in advance.

Rahul Kadukar · Accepted Answer

Use subprocess

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

EDIT: Answer without python. The first option can be used to recursively print all the sub-directories as well. The last redirect statement can be omitted or changed based on your requirement.

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

EDIT: Correcting a missing quote in awk command.

Sanchari Dan · Answer

import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

Get list of files from hdfs (hadoop) directory using python script

Tags:

python

file

directory

python-2.7

hadoop

sara

2 Answers

Rahul Kadukar

Sanchari Dan

Recent Activity

Donate For Us

Get list of files from hdfs (hadoop) directory using python script

Tags:

python

file

directory

python-2.7

hadoop

sara

2 Answers

Rahul Kadukar

Sanchari Dan

Related questions

Recent Activity

Donate For Us