Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - generate avro schema for csv/xls file

I have a XLS/CSV file which I'm reading into pandas dataframe. I want to generate an avro schema out of this dataframe.

I'm new to python as well as pandas. Kindly help.

data_frame = pd.read_excel(INPUT_PATH)

I want to generate an avro schema from this data frame on the fly. Please help

like image 592
mythic Avatar asked Jun 29 '26 03:06

mythic


1 Answers

I found the solution to it. I extracted the datatypes of the field in the pandas dataframe and saved it against the field name.

Mapped the data types to avro compatible data types ('object' in pandas -> 'string' in avro)

Created a template of an avro schema and put the substituted the field names and data types inside the 'fields :[]' part and posted it to the registry.

for instance :

    schema = {"type": "record",
            "name": schemaName,
          "fields": [
              {"name": key, "type": value} for (key, value) in myDict.items()
          ]
          }

Fastavro library can then be used to parse this schema

like image 143
mythic Avatar answered Jul 01 '26 15:07

mythic