I'm trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I'm testing.
import sys, os
import pandas as pd
mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory")
for path, subdirs, files in os.walk(path):
for name in files:
mylist.append(os.path.join(path, name))
df = pd.DataFrame(mylist)
print(df)
I also tried the sample code from this link:
Python list directory, subdirectory, and files
I'm working in Azure Databricks. I'm open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. If you are using local file API you have to reference the Databricks filesystem. Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs (see also the documentation).
So in the path /dbfs: has to be included:
root = "/dbfs/mnt/rawdata/parent/"
That is different then working with the Databricks Filesystem Utility (DBUtils). The file system utilities access Databricks File System, making it easier to use Azure Databricks as a file system:
dbutils.fs.ls("/mnt/rawdata/parent/")
For larger Data Lakes I can recommend a Scala example in the Knowledge Base. Advantage is that it runs the listing for all child leaves distributed, so will work also for bigger directories.
I got this to work.
from azure.storage.blob import BlockBlobService
blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')
blobs = []
marker = None
while True:
batch = blob_service.list_blobs('rawdata', marker=marker)
blobs.extend(batch)
if not batch.next_marker:
break
marker = batch.next_marker
for blob in blobs:
print(blob.name)
The only prerequisite is that you need to import azure.storage
. So, in the Clusters window, click 'Install-New' -> PyPI > package = 'azure.storage'. Finally, click 'Install'.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With