I am new in PySpark . can anyone help me how to read json data using pyspark. what we have done,
(1) main.py
import os.path
from pyspark.sql import SparkSession
def fileNameInput(filename,spark):
try:
if(os.path.isfile(filename)):
loadFileIntoHdfs(filename,spark)
else:
print("File does not exists")
except OSError:
print("Error while finding file")
def loadFileIntoHdfs(fileName,spark):
df = spark.read.json(fileName)
df.show()
if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
file_name = input("Enter file location : ")
fileNameInput(file_name,spark)
When I run above code it throws error message
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o41.showString.
: org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
Thanks in advance
json is read using the spark. read. json("path") function. The "multiline_dataframe" value is created for reading records from JSON files that are scattered in multiple lines so, to read such files, use-value true to multiline option and by default multiline option is set to false.
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row> . This conversion can be done using SparkSession. read(). json() on either a Dataset<String> , or a JSON file.
You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split. For further information, see JSON Files.
This conversion can be done using SQLContext. read. json() on either an RDD of String or a JSON file. Spark SQL provides an option for querying JSON data along with auto-capturing of JSON schemas for both reading and writing data.
Your JSON works in my pyspark. I can get a similar error when the record text goes across multiple lines. Please ensure that each record fits in one line. Alternatively, tell it to support multi-line records:
spark.read.json(filename, multiLine=True)
What works:
{ "employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
That outputs:
spark.read.json('/home/ernest/Desktop/brokenjson.json').printSchema()
root
|-- employees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- firstName: string (nullable = true)
| | |-- lastName: string (nullable = true)
When I try some input like this:
{
"employees": [{ "firstName": "John", "lastName": "Doe" }, { "firstName": "Anna", "lastName": "Smith" }, { "firstName": "Peter", "lastName": "Jones" } ] }
Then I get the corrupt record in schema:
root
|-- _corrupt_record: string (nullable = true)
But when used with multiline options, the latter input works too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With