Please consider following program as Minimal Reproducible Example -MRE:
import pandas as pd
import pyarrow
from pyarrow import parquet
def foo():
print(pyarrow.__file__)
print('version:',pyarrow.cpp_version)
print('-----------------------------------------------------')
df = pd.DataFrame({'A': [1,2,3], 'B':['dummy']*3})
print('Orignal DataFrame:\n', df)
print('-----------------------------------------------------')
_table = pyarrow.Table.from_pandas(df)
parquet.write_table(_table, 'foo')
_table = parquet.read_table('foo', columns=[]) #passing empty list to columns arg
df = _table.to_pandas()
print('After reading from file with columns=[]:\n', df)
print('-----------------------------------------------------')
print('Not passing [] to columns parameter')
_table = parquet.read_table('foo') #Not passing any list
df = _table.to_pandas()
print(df)
print('-----------------------------------------------------')
x = input('press any key to exit: ')
if __name__=='__main__':
foo()
When I run it from console/IDE, it reads the entire data for columns=[]
:
(env) D:\foo>python foo.py
D:\foo\env\lib\site-packages\pyarrow\__init__.py
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
After reading from file with columns=[]:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
Not passing [] to columns parameter
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
press any key to exit:
But When I run it from executable created using Pyinstaller, it reads no data for columns=[]
:
E:\foo\dist\foo\pyarrow\__init__.pyc
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
After reading from file with columns=[]:
Empty DataFrame
Columns: []
Index: [0, 1, 2]
-----------------------------------------------------
Not passing [] to columns parameter
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
press any key to exit:
As you can see passing columns=[]
gives empty dataframe in executable file but this behavior is not there while running the python file directly, and I'm not sure why there is this two different behavior for the same code in the same environment.
Looking at docstring of parquet.read_table
in source code at GitHub:
columns: list
If not None, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. 'a' will select 'a.b', 'a.c', and 'a.d.e'.
The read_table further calls dataset.read
that calls _dataset.to_table
which returns call to self.scanner
which then returns call to static method from_dataset
of Scanner
class.
Everywhere, None
has been used as default value to columns
parameter, if None
and []
are directly converted to Boolean in python, both of them will indeed be False
, but if []
is checked against None
, then it will be False
, but it is nowhere mentioned should it fetch all the columns for columns=[]
because it evaluates to be False
for Boolean value, or should it read no columns at all since the list is empty.
But why the behavior is different while running it from the Command line/IDE, than to running it from the executable created using Pyinstaller for the same version of Pyarrow?
The environment I'm on:
Here is the spec file for your reference if you want to give it a try (You need to change pathex
parameter):
foo.spec
# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None
a = Analysis(['foo.py'],
pathex=['D:\\foo'],
binaries=[],
datas=[],
hiddenimports=[],
hookspath=[],
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
[],
exclude_binaries=True,
name='foo',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
console=True )
coll = COLLECT(exe,
a.binaries,
a.zipfiles,
a.datas,
strip=False,
upx=True,
upx_exclude=[],
name='foo')
Credit to @U12-Forward for assisting me in debugging the issue.
After a bit of research and debugging, and exploring the library program files, I found that pyarrow uses _ParquetDatasetV2
and ParquetDataset
functions which are essentially two different functions that reads the data from parquet file, _ParquetDatasetV2
is used as legacy_mode
, even though these functions are defined in pyarrow.parquet module, they are coming from Dataset
module of pyarrow, which was missing in the executable created using Pyinstaller.
When I added pyarrow.Dataset
as hidden imports and created the build, the exe was raising ModuleNotFoundError
on execution due to several missing dependencies used by Dataset
module. In order to resolve it, I added all .py
files from the environment to the hidden imports and created the build again, finally it worked, by it worked, what I mean is I was able to observe the same behavior in both the environments.
The modified spec
file looks like this after modification:
# -*- mode: python ; coding: utf-8 -*-
import sys ; sys.setrecursionlimit(sys.getrecursionlimit() * 5)
block_cipher = None
a = Analysis(['foo.py'],
pathex=['D:\\foo'],
binaries=[],
datas=[],
hiddenimports=['pyarrow.benchmark', 'pyarrow.cffi', 'pyarrow.compat', 'pyarrow.compute', 'pyarrow.csv', 'pyarrow.cuda', 'pyarrow.dataset', 'pyarrow.feather', 'pyarrow.filesystem', 'pyarrow.flight', 'pyarrow.fs', 'pyarrow.hdfs', 'pyarrow.ipc', 'pyarrow.json', 'pyarrow.jvm', 'pyarrow.orc', 'pyarrow.pandas_compat', 'pyarrow.parquet', 'pyarrow.plasma', 'pyarrow.serialization', 'pyarrow.types', 'pyarrow.util', 'pyarrow._generated_version', 'pyarrow.__init__'],
hookspath=[],
runtime_hooks=[],
excludes=[],
win_no_prefer_redirects=False,
win_private_assemblies=False,
cipher=block_cipher,
noarchive=False)
pyz = PYZ(a.pure, a.zipped_data,
cipher=block_cipher)
exe = EXE(pyz,
a.scripts,
[],
exclude_binaries=True,
name='foo',
debug=False,
bootloader_ignore_signals=False,
strip=False,
upx=True,
console=True )
coll = COLLECT(exe,
a.binaries,
a.zipfiles,
a.datas,
strip=False,
upx=True,
upx_exclude=[],
name='foo')
Also, to create the build, I included the path of the virtual environment using --paths
argument:
pyinstaller --path D:\foo\env\Lib\site-packages foo.spec
Here is execution after following above mentioned steps:
E:\foo\dist\foo\pyarrow\__init__.pyc
version: 3.0.0
-----------------------------------------------------
Orignal DataFrame:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
After reading from file with columns=[]:
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
Not passing [] to columns parameter
A B
0 1 dummy
1 2 dummy
2 3 dummy
-----------------------------------------------------
press any key to exit:
It is true that it is nowhere mentioned the desired behavior for columns=[]
, but looking at ARROW-13436 opened in pyarrow by @Pace, it seems that the desired behavior for columns=[]
is to read no data columns at all, but its not an official conformation, so it is possibly a bug in pyarrow 3.0.0 itself.
The reason is that when he ran it on the IDE, he ran it inside an environment, whereas when he ran it on the command line, it was outside the environment. And the DLL files where different. So the fix was to copy the pyarrow package to to outside the environment and it gave the same result.
The pyarrow documentation for pyarrow.parquet.read_table
is probably unclear. I've raised ARROW-13436 to clarify this.
From some testing it seems that the behavior changed at some point from no columns to all columns and then changed back (in 4.0) to no columns. I believe no columns is the correct behavior.
So my guess is that your executable is using a different version of pyarrow than your IDE. You can usually confirm this by running...
import pyarrow
print(pyarrow.__file__)
print(pyarrow.cpp_version)
...on both environments and then comparing the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With