Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to access R data frame column names in python/rpy2?

Tags:

python

r

rpy2

I have an R data frame, saved in Database02.Rda. Loading it

import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")

works fine. However:

print(robjects.r.names("df"))

yields

NULL

Also, as an example, column 214 (213 if we count starting with 0) is named REGION.

print(robjects.r.table(robjects.r["df"][213]))

works fine:

Region 1   Region 2   ...
    9811       3451   ...

but we should also be able to do

print(robjects.r.table("df$REGION"))

This, however, results in

df$REGION 
        1

(which it does also for column names that do not exist at all); also:

print(robjects.r.table(robjects.r["df"]["REGION"]))

gives an error:

TypeError: SexpVector indices must be integers, not str

Now, the docs say, names can not be used for subsetting in python. Am I correct to assume that the column names are not imported whith the rest of the data when loading the data frame with python/rpy2? Am I thus correct that the easiest way to access them is to save and load them as a seperate list and construct a dict or so in python mapping the names to the column index numbers? This does not seem very generic, however. Is there a way to extract the column names directly?

The versions of R, python, rpy2 I use are: R: 3.2.2 python: 3.5.0 rpy2: 2.7.8

like image 877
0range Avatar asked Mar 03 '16 19:03

0range


People also ask

How do I find column names in R Dataframe?

To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.

How can you retrieve the names of rows and columns of a data frame in R?

To find the column names and row names in an R data frame based on a condition, we can use row. names and colnames function.

Which function is used for naming columns of Dataframe in R?

colnames() method in R is used to rename and replace the column names of the data frame in R. The columns of the data frame can be renamed by specifying the new column names as a vector.

Which is the operator used in R to access the column of a data frame is?

Data in data frames can be addressed by index (subsetting), by logical vector, or by name (columns only). Use the $ operator to address a column by name.


2 Answers

When doing the following, you are loading whatever objects are Database02.Rda into R's "global environment".

import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")

robjects.globalenv is an Environement. You can list its content with:

tuple(robjects.globalenv.keys())

Now I am understanding that one of your objects is called df. You can access it with:

df = robjects.globalenv['df']

if df is a list or a data frame, you can access its named elements with rx2 (the doc is your friend here again). To get the one called REGION, do:

df.rx2("REGION")

To list all named elements in a list or dataframe that's easy:

tuple(df.names) 
like image 150
lgautier Avatar answered Nov 14 '22 22:11

lgautier


If you run R code in python, the global environment answer will not work. But kudos to @lgautier the creator/maintainer of this package. In R the dollar sign $ is used frequently. This is what I learned:

print(pamk_clusters$pamobject$clusinfo)

will not work, and its equivalent

print(pamk_clusters[["pamobject"]][["clusinfo"]])

also will not work ... however, after some digging in the "man"

http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style

Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.

This works as expected

print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))

I commented in the forums about "man" clarity:

https://bitbucket.org/rpy2/rpy2/issues/436/acessing-dataframe-elements-using-rpy2

I am using rpy2 on Win7 with ipython. To help others dig through the formatting, here is a setup that seems to work:

import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr

base = importr('base')
utils = importr('utils')
utils.chooseCRANmirror(ind=1)

cluster = importr('cluster')
stats = importr('stats')
#utils.install_packages("fpc")
fpc = importr('fpc')

import pickle
with open ('points', 'rb') as fp:
    points = pickle.load(fp) 
# data above is stored as binary object
# online:  http://www.mshaffer.com/arizona/dissertation/points

import rpy2.robjects.numpy2ri as npr   
npr.activate()

k = robjects.IntVector(range(3, 8))   # r-syntax  3:7   # I expect 5
pamk_clusters = fpc.pamk(points,k)

print( base.summary(pamk_clusters) )
base.print( base.summary(pamk_clusters) )

utils.str(pamk_clusters)

print(pamk_clusters$pamobject$clusinfo)
base.print(pamk_clusters$pamobject$clusinfo)

print(pamk_clusters[["pamobject"]][["clusinfo"]])
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))

pam_clusters = cluster.pam(points,5)        # much slower
kmeans_clusters = stats.kmeans(points,5)    # much faster

utils.str(kmeans_clusters)

print(kmeans_clusters.rx2("cluster"))

R has been a standard for statistical computing for nearly 25 years, based on a forty-year old S - back when computing efficiency mattered a lot. https://en.wikipedia.org/wiki/R_(programming_language)

Again @lgautier, thank you for making R more readily accessible within Python

like image 31
mshaffer Avatar answered Nov 14 '22 20:11

mshaffer