Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the equivalent to R's match() for python Pandas/numpy?

I'm an R user and I cannot figure out the pandas equivalent of match(). I need use this function to iterate over a bunch of files, grab a key piece of info, and merge it back into the current data structure on 'url'. In R I'd do something like this:

logActions <- read.csv("data/logactions.csv")
logActions$class <- NA

files = dir("data/textContentClassified/")
for( i in 1:length(files)){
    tmp <- read.csv(files[i])
    logActions$class[match(logActions$url, tmp$url)] <- 
            tmp$class[match(tmp$url, logActions$url)]
}

I don't think I can use merge() or join(), as each will overwrite logActions$class each time. I can't use update() or combine_first() either, as neither have the necessary indexing capabilities. I also tried making a match() function based on this SO post, but cannot figure out how to get it to work with DataFrame objects. Apologies if I'm missing something obvious.

Here's some python code that summarizes my ineffectual attempts to do something like match() in pandas:

from pandas import *
left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]})

# Doesn't work:
left.join(right1, on='url')
merge(left, right, on='url')

# Also doesn't work the way I need it to:
left = left.combine_first(right1)
left = left.combine_first(right2)
left 

# Also does something funky and doesn't really work the way match() does:
left = left.set_index('url', drop=False)
right1 = right1.set_index('url', drop=False)
right2 = right2.set_index('url', drop=False)

left = left.combine_first(right1)
left = left.combine_first(right2)
left

The desired output is:

    url  action  class
0   foo.com  0   0
1   foo.com  1   0
2   bar.com  0   1

BUT, I need to be able to call this over and over again so I can iterate over each file.

like image 960
Solomon Avatar asked Apr 06 '13 21:04

Solomon


People also ask

Is pandas similar to R?

In conclusion, we can say that R is a programming language whereas Pandas is a library. Using the packages of R, we can perform different operations where Pandas helps us to perform different operations. This tutorial will help beginners to understand the difference between the two and also help in migrating easily.

Is there a Dplyr for Python?

Dplython. Package dplython is dplyr for Python users. It provide infinite functionality for data preprocessing.

Is pandas similar to Dplyr?

Learn More. Heey great post, but pandas has very similar functions as dplyr. If you use those instead, you get statements very similar to your dplyr statements and you would get the same readability.

Is pandas an R package?

The PANDA R package (Preferential Attachment based common Neighbor Distribution derived Associations) was designed to perform the following tasks: (1) identify significantly functionally associated protein pairs, (2) predict GO and KEGG terms for proteins, (3) make a cluster of proteins based on the significant protein ...


2 Answers

Note the existance of pandas.match which does precisely what R's match does.

like image 74
Wes McKinney Avatar answered Oct 03 '22 01:10

Wes McKinney


Edit:

If url in all right dataframes re unique, you can make the right dataframe as a Series of class indexed by url, then you can get the class of every url in left by index it.

from pandas import *
left = DataFrame({'url': ['foo.com', 'bar.com', 'foo.com', 'tmp', 'foo.com'], 'action': [0, 1, 0, 2, 4]})
left["klass"] = NaN
right1 = DataFrame({'url': ['foo.com', 'tmp'], 'klass': [10, 20]})
right2 = DataFrame({'url': ['bar.com'], 'klass': [30]})

left["klass"] = left.klass.combine_first(right1.set_index('url').klass[left.url].reset_index(drop=True))
left["klass"] = left.klass.combine_first(right2.set_index('url').klass[left.url].reset_index(drop=True))

print left

Is this what you want?

import pandas as pd
left = pd.DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = pd.DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = pd.DataFrame({'url': ['bar.com'], 'class': [ 1]})

pd.merge(left.drop("class", axis=1), pd.concat([right1, right2]), on="url")

output:

   action      url  class
0       0  foo.com      0
1       1  foo.com      0
2       0  bar.com      1

if the class column in left is not all NaN, you can combine_fist it with the result.

like image 28
HYRY Avatar answered Oct 03 '22 01:10

HYRY