Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

list the files of a directory and subdirectory recursively in Databricks(DBFS)

Using python/dbutils, how to display the files of the current directory & subdirectory recursively in Databricks file system(DBFS).

like image 920
Kiran A Avatar asked Sep 18 '20 12:09

Kiran A


People also ask

How do I check Databricks files?

You can access the file system using magic commands such as %fs or %sh . You can also use the Databricks file system utility (dbutils. fs). Azure Databricks uses a FUSE mount to provide local access to files stored in the cloud.

Where are files stored in Databricks?

When you use certain features, Azure Databricks puts files in the following folders under FileStore: /FileStore/jars - contains libraries that you upload. If you delete files in this folder, libraries that reference these files in your workspace may no longer work.


1 Answers

Surprising thing about dbutils.fs.ls (and %fs magic command) is that it doesn't seem to support any recursive switch. However, since ls function returns a list of FileInfo objects it's quite trivial to recursively iterate over them to get the whole content, e.g.:

def get_dir_content(ls_path):
  dir_paths = dbutils.fs.ls(ls_path)
  subdir_paths = [get_dir_content(p.path) for p in dir_paths if p.isDir() and p.path != ls_path]
  flat_subdir_paths = [p for subdir in subdir_paths for p in subdir]
  return list(map(lambda p: p.path, dir_paths)) + flat_subdir_paths
    

paths = get_dir_content('/databricks-datasets/COVID/CORD-19/2020-03-13')
[print(p) for p in paths]
like image 130
Daniel Avatar answered Sep 17 '22 20:09

Daniel