I'm an R user and I cannot figure out the pandas equivalent of match(). I need use this function to iterate over a bunch of files, grab a key piece of info, and merge it back into the current data structure on 'url'. In R I'd do something like this: <pre class="prettyprint"><code>logActions <- read.csv("data/logactions.csv") logActions$class <- NA files = dir("data/textContentClassified/") for( i in 1:length(files)){ tmp <- read.csv(files[i]) logActions$class[match(logActions$url, tmp$url)] <- tmp$class[match(tmp$url, logActions$url)] } </code></pre> I don't think I can use merge() or join(), as each will overwrite logActions$class each time. I can't use update() or combine_first() either, as neither have the necessary indexing capabilities. I also tried making a match() function based on this SO post, but cannot figure out how to get it to work with DataFrame objects. Apologies if I'm missing something obvious. Here's some python code that summarizes my ineffectual attempts to do something like match() in pandas: <pre class="prettyprint"><code>from pandas import * left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]}) left["class"] = NaN right1 = DataFrame({'url': ['foo.com'], 'class': [0]}) right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]}) # Doesn't work: left.join(right1, on='url') merge(left, right, on='url') # Also doesn't work the way I need it to: left = left.combine_first(right1) left = left.combine_first(right2) left # Also does something funky and doesn't really work the way match() does: left = left.set_index('url', drop=False) right1 = right1.set_index('url', drop=False) right2 = right2.set_index('url', drop=False) left = left.combine_first(right1) left = left.combine_first(right2) left </code></pre> The desired output is: <pre class="prettyprint"><code> url action class 0 foo.com 0 0 1 foo.com 1 0 2 bar.com 0 1 </code></pre> BUT, I need to be able to call this over and over again so I can iterate over each file.

Note the existance of <code>pandas.match</code> which does precisely what R's <code>match</code> does.

What is the equivalent to R's match() for python Pandas/numpy?

Tags:

I'm an R user and I cannot figure out the pandas equivalent of match(). I need use this function to iterate over a bunch of files, grab a key piece of info, and merge it back into the current data structure on 'url'. In R I'd do something like this:

logActions <- read.csv("data/logactions.csv")
logActions$class <- NA

files = dir("data/textContentClassified/")
for( i in 1:length(files)){
    tmp <- read.csv(files[i])
    logActions$class[match(logActions$url, tmp$url)] <- 
            tmp$class[match(tmp$url, logActions$url)]
}

I don't think I can use merge() or join(), as each will overwrite logActions$class each time. I can't use update() or combine_first() either, as neither have the necessary indexing capabilities. I also tried making a match() function based on this SO post, but cannot figure out how to get it to work with DataFrame objects. Apologies if I'm missing something obvious.

Here's some python code that summarizes my ineffectual attempts to do something like match() in pandas:

from pandas import *
left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]})

# Doesn't work:
left.join(right1, on='url')
merge(left, right, on='url')

# Also doesn't work the way I need it to:
left = left.combine_first(right1)
left = left.combine_first(right2)
left 

# Also does something funky and doesn't really work the way match() does:
left = left.set_index('url', drop=False)
right1 = right1.set_index('url', drop=False)
right2 = right2.set_index('url', drop=False)

left = left.combine_first(right1)
left = left.combine_first(right2)
left

The desired output is:

    url  action  class
0   foo.com  0   0
1   foo.com  1   0
2   bar.com  0   1

BUT, I need to be able to call this over and over again so I can iterate over each file.

960

asked Apr 06 '13 21:04

Solomon

2 Answers

Note the existance of pandas.match which does precisely what R's match does.

answered Oct 03 '22 01:10

Wes McKinney

Edit:

If url in all right dataframes re unique, you can make the right dataframe as a Series of class indexed by url, then you can get the class of every url in left by index it.

from pandas import *
left = DataFrame({'url': ['foo.com', 'bar.com', 'foo.com', 'tmp', 'foo.com'], 'action': [0, 1, 0, 2, 4]})
left["klass"] = NaN
right1 = DataFrame({'url': ['foo.com', 'tmp'], 'klass': [10, 20]})
right2 = DataFrame({'url': ['bar.com'], 'klass': [30]})

left["klass"] = left.klass.combine_first(right1.set_index('url').klass[left.url].reset_index(drop=True))
left["klass"] = left.klass.combine_first(right2.set_index('url').klass[left.url].reset_index(drop=True))

print left

Is this what you want?

import pandas as pd
left = pd.DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = pd.DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = pd.DataFrame({'url': ['bar.com'], 'class': [ 1]})

pd.merge(left.drop("class", axis=1), pd.concat([right1, right2]), on="url")

output:

   action      url  class
0       0  foo.com      0
1       1  foo.com      0
2       0  bar.com      1

if the class column in left is not all NaN, you can combine_fist it with the result.

answered Oct 03 '22 01:10

HYRY

Related questions
                            
                                Replacing selected elements in a list in Python
                            
                                Catch Keyboard Interrupt in program that is waiting on an Event
                            
                                NumPy min/max in-place assignment
                            
                                What is a good replacement for paramiko in python 3 ? Or is there a port of paramiko for python 3?
                            
                                link axis between different plot (no subplots) using matplotlib
                            
                                Python - Loading files relative from project root
                            
                                Python coding practice : Return None vs Return same datatype with empty value? [closed]
                            
                                Extracting information from Musicxml
                            
                                Mongoengine - How to perform a "save new item or increment counter" operation?
                            
                                Python multiprocessing Events vs Theading Events
                            
                                Cross product of sets using recursion
                            
                                How do you save a Google Sheets file as CSV from Python 3 (or 2)?
                            
                                Attribute access in Python: first slots, then __dict__?
                            
                                alternative to python's time.sleep()
                            
                                Factoring polys in sympy
                            
                                Selection Sort Python
                            
                                How to import a class from unittest in python?
                            
                                Plotting dashed 2D vectors with matplotlib?
                            
                                Using python's argparse in multiple scripts backed by multiple custom modules
                            
                                Python Multiprocessing using Queue to write to same file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the equivalent to R's match() for python Pandas/numpy?

Tags:

python

merge

pandas

r

match

Solomon

People also ask

2 Answers

Wes McKinney

HYRY

Recent Activity

Donate For Us