I couldn't find any plain English explanations regarding Apache Parquet files. Such as: <ol> <li>What are they?</li> <li>Do I need Hadoop or HDFS to view/create/store them?</li> <li>How can I create parquet files?</li> <li>How can I view parquet files?</li> </ol> Any help regarding these questions is appreciated.

This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer here or the official docs in case of Python. Basically this allows you to quickly read/ write parquet files in a pandas <code>DataFrame</code> like fashion giving you the benefits of using <code>notebooks</code> to view and handle such files like it was a regular <code>csv</code> file. EDIT: As an example, given the latest version of <code>Pandas</code>, make sure <code>pyarrow</code> is installed: Then you can simply use pandas to manipulate parquet files: <pre class="prettyprint lang-py prettyprint-override"><code>import pandas as pd # read df = pd.read_parquet('myfile.parquet') # write df.to_parquet('my_newfile.parquet') df.head() </code></pre>

How to view Apache Parquet file in Windows?

3 Answers

What is Apache Parquet?

Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.

Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:

Columnar storage: efficient data retrieval, efficient compression, etc...
Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
Supported by all Apache big data products

Do I need Hadoop or HDFS?

No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.

All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.

How can I create/read Parquet Files?

As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.

So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.

To create your own parquet files:

In Java please see my following post: Generate Parquet File using Java
In .NET please see the following library: parquet-dotnet

To view parquet file contents:

Please try the following Windows utility: https://github.com/mukunku/ParquetViewer

Are there other methods?

Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response

142

answered Oct 12 '22 06:10

Sal

This is possible now through Apache Arrow, which helps to simplify communication/transfer between different data formats, see my answer here or the official docs in case of Python.

Basically this allows you to quickly read/ write parquet files in a pandas DataFrame like fashion giving you the benefits of using notebooks to view and handle such files like it was a regular csv file.

EDIT:

As an example, given the latest version of Pandas, make sure pyarrow is installed:

Then you can simply use pandas to manipulate parquet files:

import pandas as pd

# read
df = pd.read_parquet('myfile.parquet')

# write
df.to_parquet('my_newfile.parquet')

df.head()

answered Oct 12 '22 07:10

meow

In addition to @sal's extensive answer there is one further question I encountered in this context:

How can I access the data in a parquet file with SQL?

As we are still in the Windows context here, I know of not that many ways to do that. The best results were achieved by using Spark as the SQL engine with Python as interface to Spark. However, I assume that the Zeppelin environment works as well, but did not try that out myself yet.

There is very well done guide by Michael Garlanyk to guide one through the installation of the Spark/Python combination.

Once set up, I'm able to interact with parquets through:

from os import walk
from pyspark.sql import SQLContext

sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'

# Getting all parquet files in a dir as spark contexts.
# There might be more easy ways to access single parquets, but I had nested dirs
dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))

# for each parquet file, i.e. table in our database, spark creates a tempview with
# the respective table name equal the parquet filename
print('New tables available: \n')

for parquet in filenames:
    print(parquet[:-8])
    spark.read.parquet(parquetdir+'\\'+parquet).createOrReplaceTempView(parquet[:-8])

Once loaded your parquets this way, you can interact with the Pyspark API e.g. via:

my_test_query = spark.sql("""
select
  field1,
  field2
from parquetfilename1
where
  field1 = 'something'
""")

my_test_query.show()

answered Oct 12 '22 07:10

nirolo

Related questions
                            
                                How to fix ambiguous type on method reference (toString of an Integer)?
                            
                                How to set amount of Spark executors?
                            
                                Even with slf4j, should you guard your logging?
                            
                                Priority Queue remove complexity time
                            
                                what is the difference between `public class` and just `class`?
                            
                                Open PDF file on the fly from a Java application
                            
                                How to generate CDATA block using JAXB?
                            
                                How to pass console arguments to application in eclipse?
                            
                                notifyDataSetChanged() makes the list refresh and scroll jumps back to the top
                            
                                How to know which variable is the culprit in try block?
                            
                                Kill all Gradle Daemons Regardless Version?
                            
                                Log4j: Why is the root logger collecting all log types regardless the configuration?
                            
                                Java : Read last n lines of a HUGE file
                            
                                How to compare almost similar Strings in Java? (String distance measure) [closed]
                            
                                How to save List<Object> to SharedPreferences?
                            
                                How to check which exception type was thrown in Java?
                            
                                unchecked call to ArrayAdapter
                            
                                Heap vs Stack vs Perm Space
                            
                                Java Replace Character At Specific Position Of String? [duplicate]
                            
                                How to get all table names in android sqlite database?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to view Apache Parquet file in Windows?

Tags:

java

.net

parquet

Sal

People also ask