Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to python.
Here's the original json file I'm using: [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]
What am i doing wrong?
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.read_json('C:/python/json_teste')
pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')
Error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')
C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
1256 try:
1257 with ParquetWriter(
-> 1258 where, table.schema,
1259 filesystem=filesystem,
1260 version=version,
C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068
5069 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'schema'
Print file:
#print
print(df)
a b
0 1 teste01
1 2 teste02
#following columns
df.columns
Index(['a', 'b'], dtype='object')
#following types
df.dtypes
a int64
b object
dtype: object
You can use Coiled, the cloud-based Dask platform, to easily convert large JSON data into a tabular DataFrame stored as Parquet in a cloud object-store.
It is quite common today to convert incoming JSON data into Parquet format to improve the performance of analytical queries. When JSON data has an arbitrary schema i.e. different records can contain different key-value pairs, it is common to parse such JSON payloads into a map column in Parquet.
parquet vs JSON , The JSON stores key-value format. In the opposite side, Parquet file format stores column data. So basically when we need to store any configuration we use JSON file format. While parquet file format is useful when we store the data in tabular format.
You can achieve what you are looking for by pyspark as follows:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("JsonToParquetPysparkExample") \
.getOrCreate()
json_df = spark.read.json("C://python/test.json", multiLine=True,)
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")
Here's how to convert a JSON file to Apache Parquet format, using Pandas in Python. This is an easy method with a well-known library you may already be familiar with.
Firstly, make sure to install pandas
and pyarrow
. If you're using Python with Anaconda:
conda install pandas
conda install pyarrow
Then, here is the code:
import pandas as pd
data = pd.read_json(FILEPATH_TO_JSON_FILE)
data.to_parquet(PATH_WHERE_TO_SAVE_PARQUET_FILE)
I hope this helps, please let me know if I can clarify anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With