I am trying to convert my pyspark sql dataframe to json and then save as a file. <pre class="prettyprint"><code>df_final = df_final.union(join_df) </code></pre> df_final contains the value as such: <img src="https://i.stack.imgur.com/aredn.png" alt="enter image description here"> I tried something like this. But it created a invalid json. <pre class="prettyprint"><code>df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True) {"Variable":"Col1","Min":"20","Max":"30"} {"Variable":"Col2","Min":"25,"Max":"40"} </code></pre> My expected file should have data as below: <pre class="prettyprint"><code>[ {"Variable":"Col1", "Min":"20", "Max":"30"}, {"Variable":"Col2", "Min":"25, "Max":"40"}] </code></pre>

For <code>pyspark</code> you can directly store your dataframe into json file, there is no need to convert the datafram into json. <pre class="prettyprint"><code>df_final.coalesce(1).write.format('json').save('/path/file_name.json') </code></pre> and still you want to convert your datafram into json then you can used <code>df_final.toJSON()</code>.

A solution can be using <code>collect</code> and then using <code>json.dump</code>: <pre class="prettyprint"><code>import json collected_df = df_final.collect() with open(data_output_file + 'createjson.json', 'w') as outfile: json.dump(data, outfile) </code></pre>

If you want to use spark to process result as json files, I think that your output schema is right in hdfs. And I assumed you encountered the issue that you can not smoothly read data from normal python script by using : <pre class="prettyprint"><code>with open('data.json') as f: data = json.load(f) </code></pre> You should try to read data line by line: <pre class="prettyprint"><code>data = [] with open("data.json",'r') as datafile: for line in datafile: data.append(json.loads(line)) </code></pre> and you can use <code>pandas</code> to create dataframe : <pre class="prettyprint"><code>df = pd.DataFrame(data) </code></pre>

Pyspark: How to convert a spark dataframe to json and save it as json file?

Tags:

python-3.x

apache-spark-sql

pyspark

pyspark-sql

I am trying to convert my pyspark sql dataframe to json and then save as a file.

df_final = df_final.union(join_df)

df_final contains the value as such:

enter image description here

I tried something like this. But it created a invalid json.

df_final.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)

{"Variable":"Col1","Min":"20","Max":"30"}
{"Variable":"Col2","Min":"25,"Max":"40"}

My expected file should have data as below:

[
{"Variable":"Col1",
"Min":"20",
"Max":"30"},
{"Variable":"Col2",
"Min":"25,
"Max":"40"}]

423

asked Nov 22 '18 08:11

Shankar Panda

3 Answers

For pyspark you can directly store your dataframe into json file, there is no need to convert the datafram into json.

df_final.coalesce(1).write.format('json').save('/path/file_name.json')

and still you want to convert your datafram into json then you can used df_final.toJSON().

answered Sep 27 '22 20:09

Sahil Desai

A solution can be using collect and then using json.dump:

import json
collected_df = df_final.collect()
with open(data_output_file + 'createjson.json', 'w') as outfile:
    json.dump(data, outfile)

answered Sep 27 '22 20:09

OmG

If you want to use spark to process result as json files, I think that your output schema is right in hdfs.

And I assumed you encountered the issue that you can not smoothly read data from normal python script by using :

with open('data.json') as f:
  data = json.load(f)

You should try to read data line by line:

data = []
with open("data.json",'r') as datafile:
  for line in datafile:
    data.append(json.loads(line))

and you can use pandas to create dataframe :

df = pd.DataFrame(data)

answered Sep 27 '22 19:09

chilun

Related questions
                            
                                Error while converting from xls to xlsx using win32com. This program throws error if i work with other excel sheet
                            
                                Pickle a dict subclass without __reduce__ method does not load member attributes
                            
                                How to make ttk.Scale behave more like tk.Scale?
                            
                                Can't debug unittests in Pycharm
                            
                                tensorflow ValueError: features should be a dictionary of `Tensor`s. Given type: <class 'tensorflow.python.framework.ops.Tensor'>
                            
                                Tensorflow: InvalidArgumentError: Expected image (JPEG, PNG, or GIF), got empty file
                            
                                python: how to add a new key and a value in yaml file
                            
                                Catching specific error messages in try / except
                            
                                subprocess.run isn't timing out, even though timeout is specified
                            
                                How to install Python Development tools on MSYS2
                            
                                numpy - efficiently copy values from matrix to matrix using some precalculated map
                            
                                Batch generating barcodes using ReportLab
                            
                                ImportError: No module named 'rospy'
                            
                                What is the real current status of Twisted on Python3?
                            
                                Is there a way to instruct Pylint which libraries should be considered 3rd party?
                            
                                How to run unittest test cases in the order they are declared
                            
                                Pandas to_numeric is not downcasting integer column
                            
                                Python mocking: Patching Python Pika's "basic_publish" function
                            
                                What do scatter_kws and line_kws do in seaborn lmplot
                            
                                2-opt algorithm to solve the Travelling Salesman Problem in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With