>> df = pd.DataFrame(np.random.rand(10,4))
>> pd.cut(df,[0,0.5,1])
ValueError: Input array must be 1 dimensional
How can I get pd.cut()
to work across all columns of a data frame?
To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.
We can use the pandas Series. str. split() function to break up strings in multiple columns around a given separator or delimiter. It's similar to the Python string split() method but applies to the entire Dataframe column.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
You can select a column from the pandas dataframe using the loc property available in the dataframe. It is used to locate the rows or columns from the dataframe based on the name passed. It is also called slicing the columns based on the column names. It accepts row index and column names to be selected.
Use apply
df.apply(pd.cut, bins=[0,0.5,1])
You can specify the axis
if you want to run across columns (axis=0
) or rows (axis=1
)
If you don't mind a slightly different type of labeling, numpy.digitize
provides a vectorized n-d
solution.
np.digitize(df, bins=[0, 0.5, 1.0])
array([[2, 2, 2, 2],
[1, 2, 2, 2],
[1, 1, 2, 1],
[2, 1, 2, 1],
[2, 1, 2, 1],
[2, 2, 2, 2],
[1, 2, 1, 1],
[2, 1, 2, 2],
[2, 2, 1, 1],
[2, 1, 2, 1]], dtype=int64)
The label 1
would correspond to 0-0.5
, 2
to 0.5-1.0
, etc.
Performance
df = pd.DataFrame(np.random.rand(1000, 1000))
%timeit pd.DataFrame(np.digitize(df, bins=[0, 0.5, 1.0]), columns=df.columns)
13.2 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.apply(pd.cut, bins=[0, 0.5, 1])
3.11 s ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.cut(df.stack(),[0,0.5,1]).unstack()
1.48 s ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With