Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't access directory from HDFS inside a Python script

I have the following python script(I managed to run it locally):

#!/usr/bin/env python3

import folderstats

df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)

df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)

I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.

(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - cloudera cloudera          0 2019-06-01 13:30 /user/cloudera/files

The problem is that the directory can't be accessed.

I tried to run it in my local terminal: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:

Traceback (most recent call last):
  File "script.py", line 5, in <module>
    df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats
    verbose=verbose)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats
    for f in os.listdir(folderpath):
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'

Can you please help me clarify this issue?

Thank you!

like image 921
TheRichUncle Avatar asked Apr 01 '26 07:04

TheRichUncle


1 Answers

Python runs on a single machine with a local linux (or windows) filesystem (FS).

Hadoop's HDFS project is a distributed file system setup across many machines (nodes).

There may be some custom class out there to read HDFS data in a single machine however I am not aware of any and it defeats the purpose of distributed computing.

You could either copy (source HDFS location => target Local FS location) your data from HDFS to local filesystem via hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/files /home/user/<target_directory_name> where Python lives or use something like Spark, Hive, or Impala to process/query the data.

If the data volume is quite small then copying the files from HDFS to Local FS to execute python script should be efficient for something like Cloudera Quickstart VM.

like image 193
thePurplePython Avatar answered Apr 02 '26 19:04

thePurplePython



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!