How to convert a JSON result to Parquet in python?

Tags:

Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to python.

Here's the original json file I'm using: [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]

What am i doing wrong?

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_json('C:/python/json_teste')

pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
   1256     try:
   1257         with ParquetWriter(
-> 1258                 where, table.schema,
   1259                 filesystem=filesystem,
   1260                 version=version,

C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'schema'

Print file:

#print 
print(df)
   a        b
0  1  teste01
1  2  teste02

#following columns
df.columns
Index(['a', 'b'], dtype='object')

#following types
df.dtypes
a     int64
b    object
dtype: object

568

asked Dec 02 '19 15:12

Mateus Silvestre

2 Answers

You can achieve what you are looking for by pyspark as follows:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("JsonToParquetPysparkExample") \
    .getOrCreate()

json_df = spark.read.json("C://python/test.json", multiLine=True,) 
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")

179

answered Nov 03 '22 17:11

Felix K Jose

Here's how to convert a JSON file to Apache Parquet format, using Pandas in Python. This is an easy method with a well-known library you may already be familiar with.

Firstly, make sure to install pandas and pyarrow. If you're using Python with Anaconda:

conda install pandas
conda install pyarrow

Then, here is the code:

import pandas as pd
data = pd.read_json(FILEPATH_TO_JSON_FILE)
data.to_parquet(PATH_WHERE_TO_SAVE_PARQUET_FILE)

I hope this helps, please let me know if I can clarify anything.

answered Nov 03 '22 17:11

Shane Halloran

Related questions
                            
                                CNN gives biased results
                            
                                How to replace certain values in Tensorflow tensor with the values of the other tensor?
                            
                                ImportError: No module named utils
                            
                                How to calculate pairwise distance matrix on the GPU
                            
                                boto EMR add step and auto terminate
                            
                                Tensorflow Convolution Neural Network with different sized images
                            
                                Airflow Worker Configuration
                            
                                Cross-argument validation in argparse
                            
                                How to detect when subprocess asks for input in Windows
                            
                                Precision Measurement with Opencv python
                            
                                Dynamically excluding field from Django ModelForm
                            
                                Plot two levels of x_ticklabels on a pandas multi-index dataframe [duplicate]
                            
                                How to manage a single aiohttp.ClientSession?
                            
                                Python OCR: ignore signatures in documents
                            
                                Why is a NamedTuple containing mutable objects hashable, when a Tuple containing mutable objects is not?
                            
                                Keras reports TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
                            
                                Python 3 openpyxl UserWarning: Data Validation extension not supported
                            
                                Why does my login to MS SQL with AzureML dataprep using Windows authentication fail?
                            
                                How to plot the outline of the outer edges on a Matplotlib line in Python?
                            
                                Is there a pythonic way to decouple optional functionality from a function's main purpose?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a JSON result to Parquet in python?

Tags:

python

json

parquet

Mateus Silvestre

People also ask

2 Answers

Felix K Jose

Shane Halloran

Recent Activity

Donate For Us