Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

could not read data from json using pyspark

I am new in PySpark . can anyone help me how to read json data using pyspark. what we have done,

(1) main.py

import os.path
from pyspark.sql import SparkSession

def fileNameInput(filename,spark):

    try:
        if(os.path.isfile(filename)):
            loadFileIntoHdfs(filename,spark)
        else:
            print("File does not exists")
    except OSError:
        print("Error while finding file")


def loadFileIntoHdfs(fileName,spark):
    df = spark.read.json(fileName)
    df.show()


if __name__ == '__main__':

    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    file_name = input("Enter file location : ")
    fileNameInput(file_name,spark)

When I run above code it throws error message

 File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column

Thanks in advance

like image 863
Prashant Patel Avatar asked Mar 22 '18 11:03

Prashant Patel


People also ask

How do I read JSON data in Pyspark?

json is read using the spark. read. json("path") function. The "multiline_dataframe" value is created for reading records from JSON files that are scattered in multiple lines so, to read such files, use-value true to multiline option and by default multiline option is set to false.

How do I read a JSON file in Spark?

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row> . This conversion can be done using SparkSession. read(). json() on either a Dataset<String> , or a JSON file.

How do I read a JSON file in Databricks?

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split. For further information, see JSON Files.

Which is correct code to read employee JSON JSON file in Spark?

This conversion can be done using SQLContext. read. json() on either an RDD of String or a JSON file. Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data.


1 Answers

Your JSON works in my pyspark. I can get a similar error when the record text goes across multiple lines. Please ensure that each record fits in one line. Alternatively, tell it to support multi-line records:

spark.read.json(filename, multiLine=True)

What works:

{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }

That outputs:

spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
root
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)

When I try some input like this:

{
  "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }

Then I get the corrupt record in schema:

root
 |-- _corrupt_record: string (nullable = true)

But when used with multiline options, the latter input works too.

like image 143
ernest_k Avatar answered Jan 07 '23 03:01

ernest_k