Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.

Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.

# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)

# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))

# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))


# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)

# Save into SQL table
DBI::dbWriteTable(conn = connection,
                  name = DBI::Id(schema = "schema", table = "table"), 
                  value = test)

Is it possible to write a SQL table without loading parquet files into memory?

like image 472
obruzzi Avatar asked Sep 17 '25 14:09

obruzzi


1 Answers

I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.

library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)

#Upload table using bulk insert
dbExecute(connection, 
  paste("
    BULK INSERT [database].[schema].[table]
    FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
  ")
)

here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.

Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].

column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection, 
paste0("CREATE TABLE MySqlTable
AS 
SELECT *
FROM 
  OPENROWSET(
    BULK '", f, "' FORMAT = 'PARQUET'
  ) WITH (
    ", paste0([Column definitions], ..., collapse = ', '), "
  );
")

where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

like image 111
Oliver Avatar answered Sep 19 '25 05:09

Oliver