Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List All Files in a Folder Sitting in a Data Lake

I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?

like image 671
ASH Avatar asked Nov 07 '19 14:11

ASH


2 Answers

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. If you are using local file API you have to reference the Databricks filesystem. Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs (see also the documentation).

So in the path /dbfs: has to be included:

root = "/dbfs/mnt/rawdata/parent/"

That is different then working with the Databricks Filesystem Utility (DBUtils). The file system utilities access Databricks File System, making it easier to use Azure Databricks as a file system:

dbutils.fs.ls("/mnt/rawdata/parent/")

For larger Data Lakes I can recommend a Scala example in the Knowledge Base. Advantage is that it runs the listing for all child leaves distributed, so will work also for bigger directories.

like image 118
Hauke Mallow Avatar answered Sep 23 '22 02:09

Hauke Mallow


I got this to work.

from azure.storage.blob import BlockBlobService 

blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('rawdata', marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.

like image 25
ASH Avatar answered Sep 26 '22 02:09

ASH