I've uploaded a *.zip folder to my Azure Datacricks FileStore: <img src="https://i.stack.imgur.com/k5mjH.png" alt="enter image description here"> Now I would like to unzip the folder and store it on FileStore: dbfs:/FileStore/tables/rfc_model. I know it should be easy but I get confused working in the DB Notebooks ... Thank you for help! UPDATE: I've used this formulas with no success: <code>%sh unzip /FileStore/tables/rfc_model.zip</code> and <code>%sh unzip dbfs:/FileStore/tables/rfc_model.zip</code> UPDATE: I've copied the code created by @Sim into my Databricks notebook but this error appears: <img src="https://i.stack.imgur.com/ILY12.png" alt="enter image description here"> Any idea how to fix this?

This works: <pre class="prettyprint"><code>%sh unzip /dbfs/FileStore/tables/rfc_model.zip </code></pre> The results need to be copied to destination in dbfs if needed. <pre class="prettyprint"><code>%sh cp rfc_model /dbfs/FileStore/tables </code></pre>

Unzip folder stored in Azure Databricks FileStore

2 Answers

When you use %sh you are executing shell commands on the driver node using its local filesystem. However, /FileStore/ is not in the local filesystem, which is why you are experiencing the problem. You can see that by trying:

%sh ls /FileStore
# ls: cannot access '/FileStore': No such file or directory

vs.

dbutils.fs.ls("/FileStore")
// resX: Seq[com.databricks.backend.daemon.dbutils.FileInfo] = WrappedArray(...)

You have to either use an unzip utility that can work with the Databricks file system or you have to copy the zip from the file store to the driver disk, unzip and then copy back to /FileStore.

You can address the local file system using file:/..., e.g.,

dbutils.fs.cp("/FileStore/file.zip", "file:/tmp/file.zip")

Hope this helps.

Side note 1: Databricks file system management is not super intuitive, esp when it comes to the file store. For example, in theory, the Databricks file system (DBFS) is mounted locally as /dbfs/. However, /dbfs/FileStore does not address the file store, while dbfs:/FileStore does. You are not alone. :)

Side note 2: if you need to do this for many files, you can distribute the work to the cluster workers by creating a Dataset[String] with the file paths and than ds.map { name => ... }.collect(). The collect action will force execution. In the body of the map function you will have to use shell APIs instead of %sh.

Side note 3: a while back I used the following Scala utility to unzip on Databricks. Can't verify it still works but it could give you some ideas.

  def unzipFile(zipPath: String, outPath: String): Unit = {
    val fis = new FileInputStream(zipPath)
    val zis = new ZipInputStream(fis)
    val filePattern = """(.*/)?(.*)""".r
    println("Unzipping...")
    Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
      // @todo need a consistent path handling abstraction
      //       to address DBFS mounting idiosyncracies
      val dirToCreate = outPath.replaceAll("/dbfs", "") + filePattern.findAllMatchIn(file.getName).next().group(1)
      dbutils.fs.mkdirs(dirToCreate)
      val filename = outPath + file.getName
      if (!filename.endsWith("/")) {
        println(s"FILE: ${file.getName} to $filename")
        val fout = new FileOutputStream(filename)
        val buffer = new Array[Byte](1024)
        Stream.continually(zis.read(buffer)).takeWhile(_ != -1).foreach(fout.write(buffer, 0, _))
      }
    }
  }

answered Nov 10 '22 15:11

Sim

This works:

%sh
unzip /dbfs/FileStore/tables/rfc_model.zip

The results need to be copied to destination in dbfs if needed.

%sh
cp rfc_model /dbfs/FileStore/tables

answered Nov 10 '22 14:11

nessa.gp

Related questions
                            
                                How to set environment variables in Azure ARM templates
                            
                                Microsoft Graph REST API invalid client secret
                            
                                How to disable Azure Function Homepage?
                            
                                Precompiled Azure function and CloudTable binding output doesn't work
                            
                                Azure Service Plan - Scale out charges
                            
                                How to create a class inside Microsoft Azure Function?
                            
                                SSH connection to Azure VM with Terraform
                            
                                How to list all the functions in an Azure Function App
                            
                                Call Azure Function from Javascript
                            
                                Microsoft.WindowsAzure.Storage: No valid combination of account information found
                            
                                Disable a Triggered Azure WebJob
                            
                                How to identify the HTTP method (verb) used to invoke an Azure Function
                            
                                retrieving a document by id is slow across partitions in cosmos db
                            
                                Azure service bus - Read messages sent by .NET Core 2 with BrokeredMessage.GetBody
                            
                                How to obtain value of "x5t" using Certificate credentials for application authentication
                            
                                Azure App Gateway V2 cannot be configured with NSG
                            
                                How to use output variables across agent jobs in azure release pipeline
                            
                                Turn off AwaysOn in Azure Function to get a free App Service Plan?
                            
                                Is there any way to show Cypress Test Results in Azure DevOps Test Results Tab?
                            
                                How to write a MassTransit Json Deserializer for Azure Services

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unzip folder stored in Azure Databricks FileStore

Tags:

apache-spark

azure

databricks

cincin21

People also ask

2 Answers

Sim

nessa.gp

Recent Activity

Donate For Us