Numpy for R user?

Tags:

long-time R and Python user here. I use R for my daily data analysis and Python for tasks heavier on text processing and shell-scripting. I am working with increasingly large data sets, and these files are often in binary or text files when I get them. The type of things I do normally is to apply statistical/machine learning algorithms and create statistical graphics in most cases. I use R with SQLite sometimes and write C for iteration-intensive tasks; before looking into Hadoop, I am considering investing some time in NumPy/Scipy because I've heard it has better memory management [and the transition to Numpy/Scipy for one with my background seems not that big] - I wonder if anyone has experience using the two and could comment on the improvements in this area, and if there are idioms in Numpy that deal with this issue. (I'm also aware of Rpy2 but wondering if Numpy/Scipy can handle most of my needs). Thanks -

422

asked Aug 23 '10 06:08

hatmatrix

2 Answers

R's strength when looking for an environment to do machine learning and statistics is most certainly the diversity of its libraries. To my knowledge, SciPy + SciKits cannot be a replacement for CRAN.

Regarding memory usage, R is using a pass-by-value paradigm while Python is using pass-by-reference. Pass-by-value can lead to more "intuitive" code, pass-by-reference can help optimize memory usage. Numpy also allows to have "views" on arrays (kind of subarrays without a copy being made).

Regarding speed, pure Python is faster than pure R for accessing individual elements in an array, but this advantage disappears when dealing with numpy arrays (benchmark). Fortunately, Cython lets one get serious speed improvements easily.

If working with Big Data, I find the support for storage-based arrays better with Python (HDF5).

I am not sure you should ditch one for the other but rpy2 can help you explore your options about a possible transition (arrays can be shuttled between R and Numpy without a copy being made).

185

answered Oct 07 '22 19:10

lgautier

I use NumPy daily and R nearly so.

For heavy number crunching, i prefer NumPy to R by a large margin (including R packages, like 'Matrix') I find the syntax cleaner, the function set larger, and computation is quicker (although i don't find R slow by any means). NumPy's Broadcasting functionality for instance, i do not think has an analog in R.

For instance, to read in a data set from a csv file and 'normalize' it for input to an ML algorithm (e.g., mean center then re-scale each dimension) requires just this:

data = NP.loadtxt(data1, delimiter=",")    # 'data' is a NumPy array
data -= NP.mean(data, axis=0)
data /= NP.max(data, axis=0)

Also, i find that when coding ML algorithms, i need data structures that i can operate on element-wise and that also understand linear algebra (e.g., matrix multiplication, transpose, etc.). NumPy gets this and allows you to create these hybrid structures easily (no operator overloading or subclassing, etc.).

You won't be disappointed by NumPy/SciPy, more likely you'll be amazed.

So, a few recommendations--in general and in particular, given the facts in your question:

install both NumPy and Scipy. As a rough guide, NumPy provides the core data structures (in particular the ndarray) and SciPy (which is actually several times larger than NumPy) provides the domain-specific functions (e.g., statistics, signal processing, integration).
install the repository versions, particularly w/r/t NumPy because the dev version is 2.0. Matplotlib and NumPy are tightly integrated, you can use one without the other of course, but both are the best in their respective class among python libraries. You can get all three via easy_install, which i assume you already.
NumPy/SciPy have several modules specifically directed to Machine Learning/Statistics, including the Clustering package and the Statistics package.
As well as packages directed to general computation, but which are make coding ML algorithms a lot faster, in particular, Optimization and Linear Algebra.
There are also the SciKits, not included in the base NumPy or SciPy libraries; you need to install them separately. Generally speaking, each SciKit is a set of convenience wrappers to streamline coding in a given domain. The SciKits you are likely to find most relevant are: ann (approximate Nearest Neighbor), and learn (a set of ML/Statistics regression and classification algorithms, e.g., Logistic Regression, Multi-Layer Perceptron, Support Vector Machine).

answered Oct 07 '22 19:10

doug

Related questions
                            
                                Flatten a list of lists containing single strings to a list of ints [duplicate]
                            
                                Microphone access in Python
                            
                                What's the difference between a parent and a reference property in Google App Engine?
                            
                                How to copy a file in Python with a progress bar?
                            
                                Lambda function for classes in python?
                            
                                Prototype based object orientation. The good, the bad and the ugly?
                            
                                Client Server programming in python?
                            
                                PyWinAuto still useful?
                            
                                Amazon S3 permissions
                            
                                Emacs function to message the python function I'm in
                            
                                Java equivalent of function mapping in Python
                            
                                Disable console output from subprocess.Popen in Python
                            
                                Using Python Regular Expression in Django
                            
                                What is a scripting engine?
                            
                                Finding cycle of 3 nodes ( or triangles) in a graph
                            
                                How to add an HTML class to a Django form's help_text?
                            
                                How do you modify sys.path in Google App Engine (Python)?
                            
                                Browser simulation - Python
                            
                                python remove everything between <div class="comment> .. any... </div>
                            
                                How to make shell output redirect (>) write while script is still running?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numpy for R user?

Tags:

python

r

numpy

scipy

hatmatrix

People also ask

2 Answers

lgautier

doug

Recent Activity

Donate For Us