Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R as a general purpose programming language [closed]

I liked Python before because Python has rich built-in types like sets, dicts, lists, tuples. These structures help write short scripts to process data.

On the other side, R is like Matlab, and has scalar, vector, data frame, array and list as its data types. But it lacks sets, dicts, tuples, etc. I know that list type is powerful, a lot of operations could be thought as list processing. But the idea of using R as a general purpose language is still vague.

(The following is just an example. Not mean that I focus on text processing/mining.)

For example, I need to do TF-IDF counting for a set of news articles (say 200,000 articles in a folder and its sub folders).

After I read the files, I need to do word-to-ID mapping and other counting tasks. These tasks involve string manipulation and need containers like set or map.

I know I can use another language to do these processing and load the data into R. But maybe (for small things) putting all preprocessing into a single R script is better.

So my question is does R have enough capability in this kind of rich data structures in the language level? Or If not, any packages provide good extension for R language?

like image 713
Yin Zhu Avatar asked Dec 02 '10 08:12

Yin Zhu


People also ask

Can R be used as a general-purpose language?

In the interview with DataScience.LA below, he notes that while R is often thought about as a domain-specific language (or DSL), the combination of a functional language with deferred evaluation of functional arguments actually makes it a great general-purpose language for implementing a statistical DSL.

Is R an open-source language?

R acts as an alternative to traditional statistical packages such as SPSS, SAS, and Stata such that it is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms.

Is R an outdated language?

R is a programming language and environment for statistical computing and graphics. R was based on S, which was introduced in 1976. Therefore, R can sometimes be considered as outdated. However, new packages are being developed every day, allowing the language to catch up to the more “modern” Python.

What programming language is closest to R?

MATLAB, Python, Golang, SAS, and Rust are the most popular alternatives and competitors to R Language.


1 Answers

I think that R's data pre-processing capability--i.e., everything from extracting data from its source and just before the analytics steps--has improved substantially in the past three years (the length of time i have been using R). I use python daily and have for the past seven years or so--its text-processing capabilities are superb--and still i wouldn't hesitate for a moment to use R for the type of task you mention.

A couple of provisos though. First, i would suggest looking very closely at a couple of the external packages for the set of tasks in your Q--in particular, hash (python-like key-value data structure), and stringr (consists mostly of wrappers over the less user-friendly string manipulation functions in the the base library)

Both stringr and hash are available on CRAN.

> library(hash)
> dx = hash(k1=453, k2=67, k3=913)
> dx$k1
  [1] 453
> dx = hash(keys=letters[1:5], values=1:5)
> dx
  <hash> containing 5 key-value pair(s).
   a : 1
   b : 2
   c : 3
   d : 4
   e : 5

> dx[a]
  <hash> containing 1 key-value pair(s).
  a : 1

> library(stringr)
> astring = 'onetwothree456seveneight'
> ptn = '[0-9]{3,}'
> a = str_extract_all(astring, ptn)
> a
  [[1]]
  [2] "456"

It seems also that there is a large subset of R users for whom text processing and text analytics comprise a significant portion of their day-to-day work--as evidenced by CRAN's Natural Language Processing Task View (one of about 20 such informal domain-oriented Package collections). Within that Task View is the package tm, a package dedicated to functions for text mining. Included in tm are optimized functions for processing tasks such as the one mentioned in your Q.

In addition, R has an excellent selection of packages for working interactively on reasonably large datasets (e.g., > 1 GB) often without the need to set up a parallel processing infrastructure (but which can certainly exploit a cluster if it's available). The most impressive of these in my opinion are the set of packages under the rubric "The Bigmemory Project" (CRAN) by Michael Kane and John Emerson at Yale; this Project subsumes bigmemory, biganalytics, synchronicity, bigtabulate, and bigalgebra. In sum, the techniques behind these Packages include: (i) allocating the data to shared memory, which enables coordination of shared access by separate concurrent processes to a single copy of the data; (ii) file-backed data structures (which i believe, but i am not certain, is synonymous with a memory-mapped file structure, and which works enabling very fast access from disk using pointers thus avoiding the RAM limit on available file size).

Still, quite a few functions and data structures in R's standard library make it easier to work interactively with data approaching ordinary RAM limits. For instance, .RData, a native binary format, is about as simple as possible to use (the commands are save and load) and it has excellent compression:

> library(ElemStatLearn)
> data(spam)
> format(object.size(spam), big.mark=',')
  [1] "2,344,384" # a 2.34 MB data file
> save(spam, file='test.RData')

This file, 'test.RData' is only 176 KB, greater than 10-fold compression.

like image 55
doug Avatar answered Sep 21 '22 18:09

doug