Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to use pd.cut() across columns of a data frame?

Tags:

python

pandas

>> df = pd.DataFrame(np.random.rand(10,4))
>> pd.cut(df,[0,0.5,1])

ValueError: Input array must be 1 dimensional

How can I get pd.cut() to work across all columns of a data frame?

like image 573
HappyPy Avatar asked Apr 29 '19 17:04

HappyPy


People also ask

How do you cut columns in a data frame?

To delete rows and columns from DataFrames, Pandas uses the “drop” function. To delete a column, or multiple columns, use the name of the column(s), and specify the “axis” as 1. Alternatively, as in the example below, the 'columns' parameter has been added in Pandas which cuts out the need for 'axis'.

How do you split items into multiple columns in a data frame?

We can use the pandas Series. str. split() function to break up strings in multiple columns around a given separator or delimiter. It's similar to the Python string split() method but applies to the entire Dataframe column.

How do you use PD cut function?

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

How do you select columns in a Dataframe in PD?

You can select a column from the pandas dataframe using the loc property available in the dataframe. It is used to locate the rows or columns from the dataframe based on the name passed. It is also called slicing the columns based on the column names. It accepts row index and column names to be selected.


Video Answer


2 Answers

Use apply

df.apply(pd.cut, bins=[0,0.5,1])

You can specify the axis if you want to run across columns (axis=0) or rows (axis=1)

like image 61
rafaelc Avatar answered Oct 20 '22 19:10

rafaelc


If you don't mind a slightly different type of labeling, numpy.digitize provides a vectorized n-d solution.


np.digitize(df, bins=[0, 0.5, 1.0])

array([[2, 2, 2, 2],
       [1, 2, 2, 2],
       [1, 1, 2, 1],
       [2, 1, 2, 1],
       [2, 1, 2, 1],
       [2, 2, 2, 2],
       [1, 2, 1, 1],
       [2, 1, 2, 2],
       [2, 2, 1, 1],
       [2, 1, 2, 1]], dtype=int64)

The label 1 would correspond to 0-0.5, 2 to 0.5-1.0, etc.


Performance

df = pd.DataFrame(np.random.rand(1000, 1000))

%timeit pd.DataFrame(np.digitize(df, bins=[0, 0.5, 1.0]), columns=df.columns)
13.2 ms ± 36.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.apply(pd.cut, bins=[0, 0.5, 1])
3.11 s ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.cut(df.stack(),[0,0.5,1]).unstack()
1.48 s ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 27
user3483203 Avatar answered Oct 20 '22 17:10

user3483203