Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error while converting pandas dataframe to polars dataframe (pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object)

I am converting pandas dataframe to polars dataframe but pyarrow throws error.

My code:

import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat(
        [
            pd.read_excel(
            excelfile, sheet_name=x, header=0)
                    for x in sheetnames
                    ], axis=0)

    df_pl = pl.from_pandas(df)

Error:

File "pyarrow\array.pxi", line 312, in pyarrow.lib.array

File "pyarrow\array.pxi", line 83, in pyarrow.lib._ndarray_to_array

File "pyarrow\error.pxi", line 122, in pyarrow.lib.check_status

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

I tried changing pandas dataframe dtype to str and problem is solved, but i don't want to change dtypes. Is it bug in pyarrow or am I missing something?

like image 904
Rahil Avatar asked May 24 '26 14:05

Rahil


2 Answers

Edit: Polars 0.13.42 and later

Polars now has a read_excel function that will correctly handle this situation. read_excel is now the preferred way to read Excel files into Polars.

Note: to use read_excel, you will need to install xlsx2csv (which can be installed with pip).

Polars: prior to 0.13.42

I can replicate this result. It is due to a column in the original Excel file that contains both text and numbers.

For example, create a new Excel file with one column in which you type both numbers and text, save it, and run your code on that file. I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/convert.py", line 299, in from_pandas
    return DataFrame._from_pandas(df, rechunk=rechunk, nan_to_none=nan_to_none)
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 454, in _from_pandas
    pandas_to_pydf(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 485, in pandas_to_pydf
    arrow_dict = {
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 486, in <dictcomp>
    str(col): _pandas_series_to_arrow(
  File "/home/xxx/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/construction.py", line 237, in _pandas_series_to_arrow
    return pa.array(values, pa.large_utf8(), from_pandas=nan_to_none)
  File "pyarrow/array.pxi", line 312, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object

There are several lengthy discussions on this issue, such as these:

  • to_parquet can't handle mixed type columns #21228

  • pyarrow.lib.ArrowTypeError: "Expected a string or bytes object, got a 'int' object" #349

This particular comment might be relevant, as you are concatenating the results of parsing multiple sheets in an Excel file. This may lead to conflicting dtypes for a column: https://github.com/pandas-dev/pandas/issues/21228#issuecomment-419175116

How to approach this depends on your data and its use, so I can't recommend a blanket solution (i.e., fixing your source Excel file, or changing the dtype to str).

My problem is solved by saving pandas dataframe to 'csv' format and then importing 'csv' file in polars.

import os
import polars as pl
import pandas as pd

if __name__ == "__main__":

    with open(r"test.xlsx", "rb") as f:
        excelfile = f.read()
    excelfile = pd.ExcelFile(excelfile)
    sheetnames = excelfile.sheet_names
    df = pd.concat([pd.read_excel(excelfile, sheet_name=x, header=0) 
                    for x in sheetnames 
                    ], axis=0)
    df.to_csv("temp.csv",index=False)
    df_pl = pl.scan_csv("temp.csv")
    os.remove("temp.csv")
like image 31
Rahil Avatar answered May 26 '26 04:05

Rahil



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!