Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

databricks spark sql copy into not loading data

Running Azure Databricks on Runtime 8.4.

I have a csv file that is data from AdventureWorks. I have placed it in Azure DLS Gen2 storage, blob container.

I have mounted the container using dbutils.fs.mount with configs set to use OAuth and dbx secret scope. I can successfully mount the drive and can list csv files from it.

I have created a database, and then created a table.

create database if not exists work_db;

CREATE OR REPLACE TABLE  dim_account 
( 
`AccountKey` STRING, `ParentAccountKey` STRING, `AccountCodeAlternateKey` STRING, `ParentAccountCodeAlternateKey` STRING, `AccountDescription` STRING, `AccountType` STRING, `Operator` STRING, `CustomMembers` STRING, `ValueType` STRING, `CustomMemberOptions` STRING, `corrupt` STRING
) ;

I have run all kinds of variations of the following (trying different format options)

use work_db;

truncate table dim_account;

copy into dim_account
  from (
    select AccountKey, ParentAccountKey, AccountCodeAlternateKey, ParentAccountCodeAlternateKey,
    AccountDescription, AccountType, Operator, CustomMembers, ValueType, CustomMemberOptions
    from 'dbfs:/mnt/csv_source'
  )
  FILEFORMAT = csv
  FILES = ('DimAccount.csv')
  FORMAT_OPTIONS('header'='true','columnNameOfCorruptRecord'='corrupt')
;

select * from dim_account;

I believe there was a point where it pulled data from the csv file but it now does not. I get the following output (without the select):

num_affected_rows num_inserted_rows

0 0

But, if I do something like the following (also tried a number of variations):

%python

dataSchema = "AccountKey STRING, ParentAccountKey STRING, AccountCodeAlternateKey STRING, ParentAccountCodeAlternateKey STRING, AccountDescription STRING, AccountType STRING, Operator STRING, CustomMembers STRING, ValueType STRING, CustomMemberOptions STRING, corrupt STRING"

diamonds = spark.read.csv('/mnt/csv_source/DimAccount.csv', \
    header=True, schema=dataSchema, enforceSchema=True, columnNameOfCorruptRecord='corrupt').cache()

diamonds.createOrReplaceTempView('dim_account_error_on_load')

I have no problem retrieving data.

I really dont know what's going on with the COPY INTO (I'm new to all this). Work would prefer to use SQL (or perhaps something no more complex than spark.sql()) so I'm trying to make the copy into work. The COPY INTO can also operate on multiple files for a given file structure which we are interested in doing. I'm just trying to get one file to work, let alone multiple.

like image 971
kindaran Avatar asked Nov 16 '25 04:11

kindaran


1 Answers

I got an answer from work. The issue is very simple: COPY INTO tracks files that it has already processed. By default, if you attempt to process the same file (at least by name), it wont load data. There is an option to force the load of such a file.

Sigh... it's hard being a noob.

like image 50
kindaran Avatar answered Nov 18 '25 20:11

kindaran



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!