Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate datasets dynamically based on schema?

I have multiple schema like below with different column names and data types. I want to generate test/simulated data using DataFrame with Scala for each schema and save it to parquet file.

Below is the example schema (from a sample json) to generate data dynamically with dummy values in it.

val schema1 = StructType(
  List(
    StructField("a", DoubleType, true),
    StructField("aa", StringType, true)
    StructField("p", LongType, true),
    StructField("pp", StringType, true)
  )
)

I need rdd/dataframe like this with 1000 rows each based on number of columns in the above schema.

val data = Seq(
  Row(1d, "happy", 1L, "Iam"),
  Row(2d, "sad", 2L, "Iam"),
  Row(3d, "glad", 3L, "Iam")
)

Basically.. like this 200 datasets are there for which I need to generate data dynamically, writing separate programs for each scheme is merely impossible for me.

Pls. help me with your ideas or impl. as I am new to spark.

Is it possible to generate dynamic data based on schema of different types?

like image 464
user3190018 Avatar asked Nov 30 '18 07:11

user3190018


People also ask

How do you apply a schema to a data frame?

We can create a DataFrame programmatically using the following three steps. Create an RDD of Rows from an Original RDD. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

How do I enforce a schema in Spark Dataframe?

Spark DataFrame printSchema() To get the schema of the Spark DataFrame, use printSchema() on Spark DataFrame object. From the above example, printSchema() prints the schema to console( stdout ) and show() displays the content of the Spark DataFrame.

How do you infer a schema in Spark?

Inferring the Schema Using Reflection The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns.

What is schema RDD?

SchemaRDDs are composed Row objects along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.


Video Answer


1 Answers

Using @JacekLaskowski's advice, you could generate dynamic data using generators with ScalaCheck (Gen) based on field/types you are expecting.

It could look like this:

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._

import scala.collection.JavaConverters._

val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
  ("a", DoubleType) -> Gen.choose(0.0, 100.0),
  ("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
  ("p", LongType) -> Gen.choose(0L, 10L),
  ("pp", StringType) -> Gen.oneOf("Iam", "You're")
)

val schemas = Map(
  "schema1" -> StructType(
    List(
      StructField("a", DoubleType, true),
      StructField("aa", StringType, true),
      StructField("p", LongType, true),
      StructField("pp", StringType, true)
    )),
  "schema2" -> StructType(
    List(
      StructField("a", DoubleType, true),
      StructField("pp", StringType, true),
      StructField("p", LongType, true)
    )
  )
)

val numRecords = 1000

schemas.foreach {
  case (name, schema) =>
    // create a data frame
    spark.createDataFrame(
      // of #numRecords records
      (0 until numRecords).map { _ =>
        // each of them a row
        Row.fromSeq(schema.fields.map(field => {
          // with fields based on the schema's fieldname & type else null
          dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
        }))
      }.asJava, schema)
      // store to parquet
      .write.mode(SaveMode.Overwrite).parquet(name)
}
like image 113
Tom Lous Avatar answered Oct 20 '22 00:10

Tom Lous