I'm running some large jobs in Databricks, which for now, include inventorying the data lake. I'm trying to print all blob names within a prefix (sub-folder). There are a lot of files in these sub-folders, and I'm getting about 280 rows of file names printed, but then I see this: *** WARNING: skipped 494256 bytes of output ***
Then, I get another 280 rows printed.
I'm guessing there is a control to change this, right. I certainly hope so. This is designed to work with BIG data, not ~280 records. I understand that huge data sets can easily crash a browser, but common, this is basically nothing.
Note: Using GUI, you can download full results (max 1 millions rows).
To download full results (more than 1 million), first save the file to dbfs and then copy the file to local machine using Databricks cli as follows.
dbfs cp "dbfs:/FileStore/tables/AA.csv" "A:\AzureAnalytics"
Reference: Databricks file system
The DBFS command-line interface (CLI) uses the DBFS API to expose an easy to use command-line interface to DBFS. Using this client, you can interact with DBFS using commands similar to those you use on a Unix command line. For example:
# List files in DBFS
dbfs ls
# Put local file ./apple.txt to dbfs:/apple.txt
dbfs cp ./apple.txt dbfs:/apple.txt
# Get dbfs:/apple.txt and save to local file ./apple.txt
dbfs cp dbfs:/apple.txt ./apple.txt
# Recursively put local dir ./banana to dbfs:/banana
dbfs cp -r ./banana dbfs:/banana
Reference: Installing and configuring Azure Databricks CLI
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With