reading json file in pyspark

Tags:

I'm new to PySpark, Below is my JSON file format from kafka.

{
        "header": {
        "platform":"atm",
        "version":"2.0"
       }
        "details":[
       {
        "abc":"3",
        "def":"4"
       },
       {
        "abc":"5",
        "def":"6"
       },
       {
        "abc":"7",
        "def":"8"
       }    
      ]
    }

how can I read through the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame. how can i do this in pyspark.I tried the below code.

parsed = messages.map(lambda (k,v): json.loads(v))
list = []
summed = parsed.map(lambda detail:list.append((String(['mcc']), String(['mid']), String(['dsrc']))))
output = summed.collect()
print output

It produces the error 'too many values to unpack'

Error message below at statement summed.collect()

16/09/12 12:46:10 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 16/09/12 12:46:10 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 16/09/12 12:46:10 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 16/09/12 12:46:10 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main process() File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream vs = list(itertools.islice(iterator, batch)) File "", line 1, in ValueError: too many values to unpack

252

asked Sep 10 '16 21:09

anusha

1 Answers

import pyspark
from pyspark import SparkConf

# You can configure the SparkContext

conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)

data = sqlContext.read.json(file.json)

I feel that he missed an important part of the read sequence. You have to initialize a SparkContext.

When you start a SparkContext, it also spins up a webUI on port 4040. The webUI can be accessed using http://localhost:4040. That is a useful place to check progress of all calculations.

117

answered Oct 09 '22 21:10

Gfranco008

Related questions
                            
                                Submitting spring boot application jar to spark-submit
                            
                                Pass system property to spark-submit and read file from classpath or custom path
                            
                                How to list files in S3 bucket using Spark Session?
                            
                                Spark: Sort records in groups?
                            
                                SPARK : failure: ``union'' expected but `(' found
                            
                                How to convert a JSON file to parquet using Apache Spark?
                            
                                Spark CrossValidatorModel access other models than the bestModel?
                            
                                Emit multiple pairs in map operation
                            
                                Which is efficient, Dataframe or RDD or hiveql?
                            
                                Error ExecutorLostFailure when running a task in Spark
                            
                                Spark Scala Understanding reduceByKey(_ + _)
                            
                                Spark Standalone Number Executors/Cores Control
                            
                                Missing SPARK_HOME when using SparkLauncher on AWS EMR cluster
                            
                                Scalatest and Spark giving "java.io.NotSerializableException: org.scalatest.Assertions$AssertionsHelper"
                            
                                How to skip lines while reading a CSV file as a dataFrame using PySpark?
                            
                                How to process a range of hbase rows using spark?
                            
                                How to process multi line input records in Spark
                            
                                Hive doesn't read partitioned parquet files generated by Spark
                            
                                Kafka Producer - org.apache.kafka.common.serialization.StringSerializer could not be found
                            
                                Graphx Visualization

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

reading json file in pyspark

Tags:

apache-spark

pyspark

spark-streaming

anusha

People also ask

1 Answers

Gfranco008

Recent Activity

Donate For Us