Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a JSON result to Parquet in python?

Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to python.

Here's the original json file I'm using: [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]

What am i doing wrong?

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_json('C:/python/json_teste')

pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
   1256     try:
   1257         with ParquetWriter(
-> 1258                 where, table.schema,
   1259                 filesystem=filesystem,
   1260                 version=version,

C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'schema'

Print file:

#print 
print(df)
   a        b
0  1  teste01
1  2  teste02

#following columns
df.columns
Index(['a', 'b'], dtype='object')

#following types
df.dtypes
a     int64
b    object
dtype: object
like image 568
Mateus Silvestre Avatar asked Dec 02 '19 15:12

Mateus Silvestre


People also ask

Can you convert JSON to Parquet?

You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store.

Does Parquet support JSON?

It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries. When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.

Is Parquet same as JSON?

parquet vs JSON , The JSON stores key-value format. In the opposite side, Parquet file format stores column data. So basically when we need to store any configuration we use JSON file format. While parquet file format is useful when we store the data in tabular format.


2 Answers

You can achieve what you are looking for by pyspark as follows:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("JsonToParquetPysparkExample") \
    .getOrCreate()

json_df = spark.read.json("C://python/test.json", multiLine=True,) 
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")
like image 179
Felix K Jose Avatar answered Nov 03 '22 17:11

Felix K Jose


Here's how to convert a JSON file to Apache Parquet format, using Pandas in Python. This is an easy method with a well-known library you may already be familiar with.

Firstly, make sure to install pandas and pyarrow. If you're using Python with Anaconda:

conda install pandas
conda install pyarrow

Then, here is the code:

import pandas as pd
data = pd.read_json(FILEPATH_TO_JSON_FILE)
data.to_parquet(PATH_WHERE_TO_SAVE_PARQUET_FILE)

I hope this helps, please let me know if I can clarify anything.

like image 23
Shane Halloran Avatar answered Nov 03 '22 17:11

Shane Halloran