Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read an ORC file stored locally in Python Pandas?

Can I think of an ORC file as similar to a CSV file with column headings and row labels containing data? If so, can I somehow read it into a simple pandas dataframe? I am not that familiar with tools like Hadoop or Spark, but is it necessary to understand them just to see the contents of a local ORC file in Python?

The filename is someFile.snappy.orc

I can see online that spark.read.orc('someFile.snappy.orc') works, but even after import pyspark, it is throwing error.

like image 253
Della Avatar asked Oct 19 '18 09:10

Della


3 Answers

I haven't been able to find any great options, there are a few dead projects trying to wrap the java reader. However, pyarrow does have an ORC reader that won't require you using pyspark. It's a bit limited but it works.

import pandas as pd
import pyarrow.orc as orc

with open(filename) as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()
like image 114
Rafal Janik Avatar answered Oct 23 '22 10:10

Rafal Janik


In case import pyarrow.orc as orc does not work (did not work for me in Windows 10), you can read them to Spark data frame then convert to pandas's data frame

import findspark
from pyspark.sql import SparkSession

findspark.init()
spark = SparkSession.builder.getOrCreate()
df_spark = spark.read.orc('example.orc')
df_pandas = df_spark.toPandas()
like image 3
Duy Tran Avatar answered Oct 23 '22 08:10

Duy Tran


Starting from Pandas 1.0.0, there is a built in function for Pandas.

https://pandas.pydata.org/docs/reference/api/pandas.read_orc.html

import pandas as pd
import pyarrow.orc 

df = pd.read_orc('/tmp/your_df.orc')

Be sure to read this warning about dependencies. This function might not work on Windows https://pandas.pydata.org/docs/getting_started/install.html#install-warn-orc

If you want to use read_orc(), it is highly recommended to install pyarrow using conda

like image 2
Gabe Avatar answered Oct 23 '22 08:10

Gabe