I'm wondering if there is a simpler, memory efficient way to select a subset of rows and columns from a pandas DataFrame.
For instance, given this dataframe:
df = DataFrame(np.random.rand(4,5), columns = list('abcde')) print df a b c d e 0 0.945686 0.000710 0.909158 0.892892 0.326670 1 0.919359 0.667057 0.462478 0.008204 0.473096 2 0.976163 0.621712 0.208423 0.980471 0.048334 3 0.459039 0.788318 0.309892 0.100539 0.753992
I want only those rows in which the value for column 'c' is greater than 0.5, but I only need columns 'b' and 'e' for those rows.
This is the method that I've come up with - perhaps there is a better "pandas" way?
locs = [df.columns.get_loc(_) for _ in ['a', 'd']] print df[df.c > 0.5][locs] a d 0 0.945686 0.892892
My final goal is to convert the result to a numpy array to pass into an sklearn regression algorithm, so I will use the code above like this:
training_set = array(df[df.c > 0.5][locs])
... and that peeves me since I end up with a huge array copy in memory. Perhaps there's a better way for that too?
You can convert select columns of a dataframe into an numpy array using the to_numpy() method by passing the column subset of the dataframe.
to_numpy() – Convert dataframe to Numpy array. Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). This data structure can be converted to NumPy ndarray with the help of Dataframe. to_numpy() method.
Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.
DataFrames and Series in PandasSeries are similar to one-dimensional NumPy arrays, with a single dtype, although with an additional index (list of row labels). DataFrames are an ordered sequence of Series, sharing the same index, with labeled columns.
Use its value directly:
In [79]: df[df.c > 0.5][['b', 'e']].values Out[79]: array([[ 0.98836259, 0.82403141], [ 0.337358 , 0.02054435], [ 0.29271728, 0.37813099], [ 0.70033513, 0.69919695]])
Perhaps something like this for the first problem, you can simply access the columns by their names:
>>> df = pd.DataFrame(np.random.rand(4,5), columns = list('abcde')) >>> df[df['c']>.5][['b','e']] b e 1 0.071146 0.132145 2 0.495152 0.420219
For the second problem:
>>> df[df['c']>.5][['b','e']].values array([[ 0.07114556, 0.13214495], [ 0.49515157, 0.42021946]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With