Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to show user schema in a Parquet file using DuckDB?

I am trying to use DuckDB to show the user-created schema that I have written into a Parquet file. I can demonstrate in Python (using the code example at Get schema of parquet file in Python) that the schema is as I desire, but cannot seem to find a way in DuckDB to get this information.

Neither of the following queries reports the user-created schema

select * from parquet_schema('FileWithMetadata.parquet') 
select * from parquet_metadata('FileWithMetadata.parquet')

update:

Here is the code snippet that creates the metadata in the Parquet file where a is a Pandas dataframe of daily basin flows for a number of different simulations:

table = pa.Table.from_pandas(a)
my_schema = pa.schema([
    pa.field("Flow", "float", True, metadata={"data":"flow in mm per day"}),
    pa.field("DayIndex", "int64", False, metadata={"data":"index of days"}),
    pa.field("BasinIndex", "string", True, metadata={"data":"flow in mm per day"}),
    pa.field("Simulation", "int64", True, metadata={"data":"simulation number"})
    ],

    metadata={"info":"long format basin flows"})

t2 = table.cast(my_schema)
pq.write_table(t2, 'SHALongWithMetadata1.parquet')

and the code to read it back is:

import pyarrow.parquet as pq
pfile = pq.read_table("C:\Projects\CSDP\PythonCSDP\Parquet\SHALongWithMetadata1.parquet")
print("Column names: {}".format(pfile.column_names))
print("Schema: {}".format(pfile.schema))

and this yields as output :

Column names: ['Flow', 'DayIndex', 'BasinIndex', 'Simulation']
Schema: Flow: float
  -- field metadata --
  data: 'flow in mm per day'
DayIndex: int64 not null
  -- field metadata --
  data: 'index of days'
BasinIndex: string
  -- field metadata --
  data: 'flow in mm per day'
Simulation: int64
  -- field metadata --
  data: 'simulation number'
-- schema metadata --
info: 'long format basin flows'
like image 233
rbmales Avatar asked Sep 11 '25 03:09

rbmales


1 Answers

Use the DESCRIBE keyword:

DESCRIBE
SELECT *
FROM 'FileWithMetadata.parquet';

DESCRIBE TABLE
'FileWithMetadata.parquet';

Table function parquet_metadata repeats the same information for each row group in the Parquet file, so use SELECT DISTINCT to report only one set of unique column names and types:

SELECT DISTINCT
  path_in_schema,
  type 
FROM parquet_metadata('FileWithMetadata.parquet');
like image 103
Derek Mahar Avatar answered Sep 13 '25 15:09

Derek Mahar