Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pitfalls in R for Python programmers

Tags:

python

r

I have mostly programmed in Python, but I am now learning the statistical programming language R. I have noticed some difference between the languages that tend to trip me.

Suppose v is a vector/array with the integers from 1 to 5 inclusive.

v[3]  # in R: gives me the 3rd element of the vector: 3
      # in Python: is zero-based, gives me the integer 4
v[-1] # in R: removes the element with that index
      # in Python: gives me the last element in the array

Are there any other pitfalls I have to watch out for?

like image 766
BioGeek Avatar asked Jan 01 '11 12:01

BioGeek


4 Answers

Having written tens of thousands of lines of code in both languages, R is just a lot more idiosyncratic and less consistent than Python. It's really nice for doing quick plots and investigation on a small to medium size dataset, mainly because its built-in dataframe object is nicer than the numpy/scipy equivalent, but you'll find all kinds of weirdness as you do things more complicated than one liners. My advice is to use rpy2 (which unfortunately has a much worse UI than its predecessor, rpy) and just do as little as possible in R with the rest in Python.

For example, consider the following matrix code:

> u = matrix(1:9,nrow=3,ncol=3)
> v = u[,1:2]
> v[1,1]
[2] 1
> w = u[,1]
> w[1,1]
Error in w[1, 1] : incorrect number of dimensions

How did that fail? The reason is that if you select a submatrix from a matrix which has only one column along any given axis, R "helpfully" drops that column and changes the type of the variable. So w is a vector of integers rather than a matrix:

> class(v)
[1] "matrix"
> class(u)
[1] "matrix"
> class(w)
[1] "integer"

To avoid this, you need to actually pass an obscure keyword parameter:

> w2 = u[,1,drop=FALSE]
> w2[1,1]
[3] 1
> class(w2)
[1] "matrix"

There's a lot of nooks and crannies like that. Your best friend at the beginning will be introspection and online help tools like str,class,example, and of course help. Also, make sure to look at the example code on the R Graph Gallery and in Ripley's Modern Applied Statistics with S-Plus book.


EDIT: Here's another great example with factors.

> xx = factor(c(3,2,3,4))
> xx
[1] 3 2 3 4
Levels: 2 3 4
> yy = as.numeric(xx)
> yy
[1] 2 1 2 3

Holy cow! Converting something from a factor back to a numeric didn't actually do the conversion you thought it would. Instead it's doing it on the internal enumerated type of the factor. This is a source of hard-to-find bugs for people who aren't aware of this, because it's still returning integers and will in fact actually work some of the time (when the input is already numerically ordered).

This is what you actually need to do

> as.numeric(levels(xx))[xx]
[1] 3 2 3 4

Yeah, sure, that fact is on the factor help page, but you only land up there when you've lost a few hours to this bug. This is another example of how R does not do what you intend. Be very, very careful with anything involving type conversions or accessing elements of arrays and lists.

like image 166
ramanujan Avatar answered Nov 20 '22 21:11

ramanujan


This isn't specifically addressing the Python vs. R background, but the R inferno is a great resource for programmers coming to R.

like image 40
Christian Avatar answered Nov 20 '22 21:11

Christian


The accepted answer to this post is possibly a bit outdated. The Pandas Python library now provides amazing R-like DataFrame support.

like image 45
Mike Vella Avatar answered Nov 20 '22 19:11

Mike Vella


There may be... but before you embark on that have you tried some of the available Python extensions? Scipy has a list.

like image 1
Keith Avatar answered Nov 20 '22 21:11

Keith