List All Files in a Folder Sitting in a Data Lake

Question

I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?

Hauke Mallow · Accepted Answer

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. If you are using local file API you have to reference the Databricks filesystem. Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs (see also the documentation).

So in the path /dbfs: has to be included:

root = "/dbfs/mnt/rawdata/parent/"

That is different then working with the Databricks Filesystem Utility (DBUtils). The file system utilities access Databricks File System, making it easier to use Azure Databricks as a file system:

dbutils.fs.ls("/mnt/rawdata/parent/")

For larger Data Lakes I can recommend a Scala example in the Knowledge Base. Advantage is that it runs the listing for all child leaves distributed, so will work also for bigger directories.

ASH · Answer

I got this to work.

from azure.storage.blob import BlockBlobService 

blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('rawdata', marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.

List All Files in a Folder Sitting in a Data Lake

Tags:

python

scala

databricks

azure-data-lake

azure-databricks

ASH

2 Answers

Hauke Mallow

ASH

Recent Activity

Donate For Us

List All Files in a Folder Sitting in a Data Lake

Tags:

python

scala

databricks

azure-data-lake

azure-databricks

ASH

2 Answers

Hauke Mallow

ASH

Related questions

Recent Activity

Donate For Us