Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What software package can you suggest for a programmer who rarely works with statistics?

Tags:

r

statistics

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.

As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):

date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296

I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).

The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.

Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.

Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.

So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.

like image 241
flodin Avatar asked Dec 04 '22 21:12

flodin


2 Answers

@flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.

For your example file, I would do:

dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))

then post process the date and time variables into an R date-time object class such as POSIXct, with something like:

dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))

As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:

> foo <- read.csv("log.csv")
> foo
        date     time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03           45004472       184177208
2 2011-06-28 00:00:18           45292232       184177208
  PS.Perm.Gen.Used
1         94048296
2         94048296

Replicate that, 50000 times:

out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo))) 
out[, 1] <- rep(foo[,1], times = 50000) 
out[, 2] <- rep(foo[,2], times = 50000) 
out[, 3] <- rep(foo[,3], times = 50000) 
out[, 4] <- rep(foo[,4], times = 50000) 
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)

Write it out

write.csv(out, file = "bigLog.csv", row.names = FALSE)

Time loading the naive way and the proper way:

system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
                            colClasses = c(rep("character",2), 
                                           rep("integer", 3))))

Which is very quick on my modest laptop:

> system.time(in1 <- read.csv("bigLog.csv"))
   user  system elapsed 
  0.355   0.008   0.366 
> system.time(in2 <- read.csv("bigLog.csv",
                              colClasses = c(rep("character",2), 
                                             rep("integer", 3))))
   user  system elapsed 
  0.282   0.003   0.287

For both ways of reading in.

As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.

It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:

working <- out[sample(nrow(out), 1000), ]

for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.

For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.

like image 173
Gavin Simpson Avatar answered Dec 07 '22 23:12

Gavin Simpson


R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files

The simplest example of reading a CSV file:

import csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

To use another separator, e.g. tab and extract n-th column, use

spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
   print row[n]

To operate on columns use the built-in list data-type, it's extremely versatile!

To create beautiful plots I use matplotlib scatter plotcode

The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)

like image 31
Fredrik Pihl Avatar answered Dec 07 '22 23:12

Fredrik Pihl