Documentation for Apache's Parquet Java API?

Tags:

parquet

I would like to use Apache's parquet-mr project to read/write Parquet files programmatically with Java. I can't seem to find any documentation for how to use this API (aside from going through the source code and seeing how it's used) -- just wondering if any such documentation exists?

514

asked May 02 '17 17:05

Jason Evans

1 Answers

I wrote a blog article about reading parquet files (http://www.jofre.de/?p=1459) and came up with the following solution that even is capable of reading INT96 fields.

You need the following maven dependencies:

<dependencies>
  <dependency>
    <groupId>org.apache.parquet</groupId>
    <artifactId>parquet-hadoop</artifactId>
    <version>1.9.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.0</version>
  </dependency>
</dependencies>

The code basically is:

public class Main {

    private static Path path = new Path("file:\\C:\\Users\\file.snappy.parquet");

    private static void printGroup(Group g) {

        int fieldCount = g.getType().getFieldCount();
        for (int field = 0; field < fieldCount; field++) {
            int valueCount = g.getFieldRepetitionCount(field);

            Type fieldType = g.getType().getType(field);
            String fieldName = fieldType.getName();

            for (int index = 0; index < valueCount; index++) {
                if (fieldType.isPrimitive()) {
                    System.out.println(fieldName + " " + g.getValueToString(field, index));
                }
            }
        }

    }

    public static void main(String[] args) throws IllegalArgumentException {

        Configuration conf = new Configuration();

        try {
            ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
            MessageType schema = readFooter.getFileMetaData().getSchema();
            ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);

            PageReadStore pages = null;
            try {
                while (null != (pages = r.readNextRowGroup())) {
                    final long rows = pages.getRowCount();
                    System.out.println("Number of rows: " + rows);

                    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
                    final RecordReader<Group> recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
                    for (int i = 0; i < rows; i++) {
                        final Group g = recordReader.read();
                        printGroup(g);

                        // TODO Compare to System.out.println(g);
                    }
                }
            } finally {
                r.close();
            }
        } catch (IOException e) {
            System.out.println("Error reading parquet file.");
            e.printStackTrace();
        }

    }
}

145

answered Nov 15 '22 15:11

padmalcom

Related questions
                            
                                Create hive external table from partitioned parquet files in Azure HDInsights
                            
                                How to convert a JSON file to parquet using Apache Spark?
                            
                                Hive - Varchar vs String , Is there any advantage if the storage format is Parquet file format
                            
                                Hive doesn't read partitioned parquet files generated by Spark
                            
                                Spark import of Parquet files converts strings to bytearray
                            
                                Offloading data files from Amazon Redshift to Amazon S3 in Parquet format
                            
                                Spark DataFrame Repartition and Parquet Partition
                            
                                How to copy and convert parquet files to csv
                            
                                Read few parquet files at the same time in Spark
                            
                                Apache Parquet Could not read footer: java.io.IOException:
                            
                                Parquet Writer to buffer or byte stream
                            
                                Big data signal analysis: better way to store and query signal data
                            
                                PySpark: org.apache.spark.sql.AnalysisException: Attribute name ... contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it [duplicate]
                            
                                Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data
                            
                                How to convert a JSON result to Parquet in python?
                            
                                Read Parquet file stored in S3 with AWS Lambda (Python 3)
                            
                                How to convert spark SchemaRDD into RDD of my case class?
                            
                                Append a new column to an existing parquet file
                            
                                GUI tools for viewing/editing Apache Parquet
                            
                                How to Query parquet data from Amazon Athena?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With