I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to <code>DataFrame</code> and have them as a <code>Row</code>: <pre class="prettyprint"><code>spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "topic") .load() .select(col("value").cast(StringType).as("col")) .writeStream() .format("console") .start(); </code></pre> with this I can achieve: <pre class="prettyprint lang-none prettyprint-override"><code>+--------------------+ | col| +--------------------+ |{"myField":"somet...| +--------------------+ </code></pre> I wanted something more like this: <pre class="prettyprint lang-none prettyprint-override"><code>+--------------------+ | myField| +--------------------+ |"something" | +--------------------+ </code></pre> I tried to use <code>from_json</code> function using <code>struct</code>: <pre class="prettyprint"><code>DataTypes.createStructType( new StructField[] { DataTypes.createStructField("myField", DataTypes.StringType) } ) </code></pre> but I only got: <pre class="prettyprint"><code>+--------------------+ | jsontostructs(col)| +--------------------+ |[something] | +--------------------+ </code></pre> then I tried to use <code>explode</code> but I only got Exception saying: <pre class="prettyprint"><code>cannot resolve 'explode(`col`)' due to data type mismatch: input to function explode should be array or map type, not StructType(StructField(... </code></pre> Any idea how to make this work?

You're almost there, just select the right thing. <code>from_json</code> returns a <code>struct</code> column matching the schema. If schema (JSON representation) looks like this: <pre class="prettyprint"><code>{"type":"struct","fields":[{"name":"myField","type":"string","nullable":false,"metadata":{}}]} </code></pre> you'll get nested object equivalent to: <pre class="prettyprint"><code>root |-- jsontostructs(col): struct (nullable = true) | |-- myField: string (nullable = false) </code></pre> You can use <code>getField</code> (or <code>getItem</code>) method to select specific field <pre class="prettyprint"><code>df.select(from_json(col("col"), schema).getField("myField").alias("myField")); </code></pre> or <code>.*</code> to select all top level fields in the <code>struct</code>: <pre class="prettyprint"><code>df.select(from_json(col("col"), schema).alias("tmp")).select("tmp.*"); </code></pre> although for single <code>string</code> column, <code>get_json_object</code> should be more than enough: <pre class="prettyprint"><code>df.select(get_json_object(col("col"), "$.myField")); </code></pre>

jsontostructs to Row in spark structured streaming

I'm using Spark 2.2 and i'm trying to read the JSON messages from Kafka, transform them to DataFrame and have them as a Row:

spark
    .readStream()
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "topic")
    .load()
    .select(col("value").cast(StringType).as("col"))
    .writeStream()
    .format("console")
    .start();

with this I can achieve:

+--------------------+
|                 col|
+--------------------+
|{"myField":"somet...|
+--------------------+

I wanted something more like this:

+--------------------+
|             myField|
+--------------------+
|"something"         |
+--------------------+

I tried to use from_json function using struct:

DataTypes.createStructType(
    new StructField[] {
            DataTypes.createStructField("myField", DataTypes.StringType)
    }
)

but I only got:

+--------------------+
|  jsontostructs(col)|
+--------------------+
|[something]         |
+--------------------+

then I tried to use explode but I only got Exception saying:

cannot resolve 'explode(`col`)' due to data type mismatch: 
input to function explode should be array or map type, not 
StructType(StructField(...

Any idea how to make this work?

How do you handle late data in structured streaming?

Watermarking is a feature in Spark Structured Streaming that is used to handle the data that arrives late. Spark Structured Streaming can maintain the state of the data that arrives, store it in memory, and update it accurately by aggregating it with the data that arrived late.

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

How does Spark handle duplicates in streaming?

Spark doesn't have a distinct method that takes columns that should run distinct on however, Spark provides another signature of dropDuplicates() function which takes multiple columns to eliminate duplicates. Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows removed.

You're almost there, just select the right thing. from_json returns a struct column matching the schema. If schema (JSON representation) looks like this:

{"type":"struct","fields":[{"name":"myField","type":"string","nullable":false,"metadata":{}}]}

you'll get nested object equivalent to:

root
 |-- jsontostructs(col): struct (nullable = true)
 |    |-- myField: string (nullable = false)

You can use getField (or getItem) method to select specific field

df.select(from_json(col("col"), schema).getField("myField").alias("myField"));

or .* to select all top level fields in the struct:

df.select(from_json(col("col"), schema).alias("tmp")).select("tmp.*");

although for single string column, get_json_object should be more than enough:

df.select(get_json_object(col("col"), "$.myField"));

jsontostructs to Row in spark structured streaming

Tags:

java

apache-spark

apache-spark-sql

apache-spark-2.0

spark-structured-streaming

Martin Brisiak

People also ask

1 Answers

zero323

Recent Activity

Donate For Us

jsontostructs to Row in spark structured streaming

Tags:

java

apache-spark

apache-spark-sql

apache-spark-2.0

spark-structured-streaming

Martin Brisiak

People also ask

1 Answers

zero323

Related questions

Recent Activity

Donate For Us