I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface is available as sqlParser of a SessionState.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With