i want to reduce loading memory usage by filter some gid
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
But in docs kwargs havn't been shown . For example:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
so,how can i only load gid that i want to calculate?
The introduction of the **kwargs
to the pandas library is documented here. It looks like the original intent was to actually pass columns
into the request to limit IO volumn. The contributors took the next step and added a general pass for **kwargs
.
For pandas/io/parquet.py
the following is for read_parquet
:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
For pandas/io/parquet.py
the following is for read
on the pyarrow
engine:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py
the following is for read_pandas
:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
For pyarrow/parquet.py
the following is for read
:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
So, if I understand correctly maybe you can access nthreads
and use_pandas_metadata
- but then again, neither is explicitly assigned (??). I haven't tested it - but it maybe a start.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With