Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Equivalent of R's which()

Variations of this question have been asked before, I'm still having trouble understanding how to actually slice a python series/pandas dataframe based on conditions that I'd like to set.

In R, what I'm trying to do is:

df[which(df[,colnumber] > somenumberIchoose),]

The which() function finds indices of row entries in a column in the dataframe which are greater than somenumberIchoose, and returns this as a vector. Then, I slice the dataframe by using these row indices to indicate which rows of the dataframe I would like to look at in the new form.

Is there an equivalent way to do this in python? I've seen references to enumerate, which I don't fully understand after reading the documentation. My sample in order to get the row indices right now looks like this:

indexfuture = [ x.index(), x in enumerate(df['colname']) if x > yesterday]  

However, I keep on getting an invalid syntax error. I can hack a workaround by for looping through the values, and manually doing the search myself, but that seems extremely non-pythonic and inefficient.

What exactly does enumerate() do? What is the pythonic way of finding indices of values in a vector that fulfill desired parameters?

Note: I'm using Pandas for the dataframes

like image 276
ding Avatar asked Aug 01 '14 18:08

ding


People also ask

Is Pandas similar to R?

In conclusion, we can say that R is a programming language whereas Pandas is a library. Using the packages of R, we can perform different operations where Pandas helps us to perform different operations. This tutorial will help beginners to understand the difference between the two and also help in migrating easily.

Which Python function is like R?

The Apply Function in Python The pandas package for Python also has a function called apply, which is equivalent to its R counterpart; the following code illustrates how to use it. In pandas, axis=0 specifies columns and axis=1 specifies rows.

Is Pandas like dplyr?

Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead. There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.

Is there a dplyr for Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.


2 Answers

I may not understand clearly the question, but it looks like the response is easier than what you think:

using pandas DataFrame:

df['colname'] > somenumberIchoose

returns a pandas series with True / False values and the original index of the DataFrame.

Then you can use that boolean series on the original DataFrame and get the subset you are looking for:

df[df['colname'] > somenumberIchoose]

should be enough.

See http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

like image 65
fdeheeger Avatar answered Sep 16 '22 18:09

fdeheeger


What what I know of R you might be more comfortable working with numpy -- a scientific computing package similar to MATLAB.

If you want the indices of an array who values are divisible by two then the following would work.

arr = numpy.arange(10)
truth_table = arr % 2 == 0
indices = numpy.where(truth_table)
values = arr[indices]

It's also easy to work with multi-dimensional arrays

arr2d = arr.reshape(2,5)
col_indices = numpy.where(arr2d[col_index] % 2 == 0)
col_values = arr2d[col_index, col_indices]
like image 36
Dunes Avatar answered Sep 17 '22 18:09

Dunes