Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter some data by read_parquet() in pandas?

Tags:

pandas

parquet

i want to reduce loading memory usage by filter some gid

reg_df = pd.read_parquet('/data/2010r.pq',
                             columns=['timestamp', 'gid', 'uid', 'flag'])

But in docs kwargs havn't been shown . For example:

gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]

so,how can i only load gid that i want to calculate?

like image 658
Cherrymelon Avatar asked Oct 16 '22 16:10

Cherrymelon


1 Answers

The introduction of the **kwargs to the pandas library is documented here. It looks like the original intent was to actually pass columns into the request to limit IO volumn. The contributors took the next step and added a general pass for **kwargs.

For pandas/io/parquet.py the following is for read_parquet:

def read_parquet(path, engine='auto', columns=None, **kwargs):
    """
    Load a parquet object from the file path, returning a DataFrame.
    .. versionadded 0.21.0
    Parameters
    ----------
    path : string
        File path
    columns: list, default=None
        If not None, only these columns will be read from the file.
        .. versionadded 0.21.1
    engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
        Parquet library to use. If 'auto', then the option
        ``io.parquet.engine`` is used. The default ``io.parquet.engine``
        behavior is to try 'pyarrow', falling back to 'fastparquet' if
        'pyarrow' is unavailable.
    kwargs are passed to the engine
    Returns
    -------
    DataFrame
    """

    impl = get_engine(engine)
    return impl.read(path, columns=columns, **kwargs)

For pandas/io/parquet.py the following is for read on the pyarrow engine:

def read(self, path, columns=None, **kwargs):
    path, _, _, should_close = get_filepath_or_buffer(path)
    if self._pyarrow_lt_070:
        result = self.api.parquet.read_pandas(path, columns=columns,
                                              **kwargs).to_pandas()
    else:
        kwargs['use_pandas_metadata'] = True    #<-- only param for kwargs...
        result = self.api.parquet.read_table(path, columns=columns,
                                             **kwargs).to_pandas()
    if should_close:
        try:
            path.close()
        except:  # noqa: flake8
            pass

    return result

for pyarrow/parquet.py the following is for read_pandas:

def read_pandas(self, **kwargs):
    """
    Read dataset including pandas metadata, if any. Other arguments passed
    through to ParquetDataset.read, see docstring for further details

    Returns
    -------
    pyarrow.Table
        Content of the file as a table (of columns)
    """
    return self.read(use_pandas_metadata=True, **kwargs)  #<-- params being passed

For pyarrow/parquet.py the following is for read:

def read(self, columns=None, nthreads=1, use_pandas_metadata=False):  #<-- kwargs param at pyarrow
        """
        Read a Table from Parquet format

        Parameters
        ----------
        columns: list
            If not None, only these columns will be read from the file. A
            column name may be a prefix of a nested field, e.g. 'a' will select
            'a.b', 'a.c', and 'a.d.e'
        nthreads : int, default 1
            Number of columns to read in parallel. If > 1, requires that the
            underlying file source is threadsafe
        use_pandas_metadata : boolean, default False
            If True and file has custom pandas schema metadata, ensure that
            index columns are also loaded

        Returns
        -------
        pyarrow.table.Table
            Content of the file as a table (of columns)
        """
        column_indices = self._get_column_indices(
            columns, use_pandas_metadata=use_pandas_metadata)
        return self.reader.read_all(column_indices=column_indices,
                                    nthreads=nthreads)

So, if I understand correctly maybe you can access nthreads and use_pandas_metadata - but then again, neither is explicitly assigned (??). I haven't tested it - but it maybe a start.

like image 183
Bill Armstrong Avatar answered Oct 21 '22 09:10

Bill Armstrong