Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple data operations: R vs python

I have been happily using R for doing data analysis. By data analysis I mean: given a relatively small table (<1 mio rows, <100 columns), answer 'complicated' questions about the data like 'for each instance, what was the last event that happened before a specific point in time varying with the instance' and so forth.

In recent times I was put into an environment where people are using python. As far as I know, the only package for doing these things is pandas. Tried though I have, I am still struggling (after a few weeks) with the most simple operations. Let us consider this scenario: I am looking at processes (identified by a column PROC_ID) that consist of different events sorted by a column 'SORT_NR'. For some weird reason I want to do the following: Given a fixed process id proc_id I want to add a certain number 'add' to all SORT_NR such that SORT_NR >= start for a fixed parameter start. Example:

PROC_ID | SORT_NR
      A |       1
      A |       2
      A |       3
      A |       4
      A |       5
      B |       1
      B |       2

and I am calling this function now with proc_id=A, start=3, add=2 meaning that the expected result would be

PROC_ID | SORT_NR
      A |       1
      A |       2
      A |       5 <<< 2 was added
      A |       6 <<< 2 was added
      A |       7 <<< 2 was added
      B |       1
      B |       2

googling gave me the answer that this can be done via

df.loc[(df['PROC_ID'] == proc_id) & (df['SORT_NR'] >= start), 'SORT_NR'] = df.loc[(df['PROC_ID'] == proc_id) & (df['SORT_NR'] >= start), 'SORT_NR'] + add

I am writing that explicitly without formating it in order to make it clear: This command is a mess. Looking at it you do not have a chance to grasp easily what this is about. Let us now look at the respective command in R's data.table package:

df[PROC_ID == proc_id & SORT_NR >= start, SORT_NR := SORT_NR + add]

so we see

  1. in pandas we have a lot of repetition (you always have to repeat df if you want to access its columns which is not only unnecessary, it is even harmful if you rename the table)
  2. we have additional completely unnecessary special characters: ' and brackets. That just distracts the eye.
  3. all in all we use 154 characters for the pandas command and 68 (roughly a third!) characters in data.table

I do not want to start a flame war 'R vs python' I just want to know:

Am I using pandas in a wrong way? Is there a hidden knowledge that is somewhat not available to me?

or

Is pandas just not very 'efficient'? (in the sense that there is a lot of repetition and clutter that makes things hard to read and to understand)

In the second case: why do so many people prefer python over R?

EDIT: There are so many more confusing examples. I hardly execute a single command that reacts as expected:

'EXPERIMENT_NUMBER' in process_events.columns
Out[10]: True
'EXPERIMENT_ID' in process_events.columns
Out[11]: True
process_events.drop(['EXPERIMENT_NUMBER', 'EXPERIMENT_ID'])

Traceback (most recent call last):
  ...
    raise KeyError("{} not found in axis".format(labels[mask]))
KeyError: "['EXPERIMENT_NUMBER' 'EXPERIMENT_ID'] not found in axis"
like image 844
Fabian Werner Avatar asked Aug 14 '19 08:08

Fabian Werner


People also ask

Which is better for working with data Python or R?

If you're passionate about the statistical calculation and data visualization portions of data analysis, R could be a good fit for you. If, on the other hand, you're interested in becoming a data scientist and working with big data, artificial intelligence, and deep learning algorithms, Python would be the better fit.

Is R simpler than Python?

R can be difficult for beginners to learn due to its non-standardized code. Python is usually easier for most learners and has a smoother linear curve. In addition, Python requires less coding time since it's easier to maintain and has a syntax similar to the English language.

Is R better than Python for data visualization?

If you find Python not easy for complex visualizations, that is when R comes into play. R is ideal for those complex calculations whose packages and libraries are built to support analytical visualizations.

How does R compare to Python?

The main difference is that Python is a general-purpose programming language, while R has its roots in statistical analysis. Increasingly, the question isn't which to choose, but how to make the best use of both programming languages for your specific use cases.


Video Answer


1 Answers

I know you've written it purposefully verbosely, but it can be written much more simply with a variable and += operator

df.loc[(df['PROC_ID'] == proc_id) & (df['SORT_NR'] >= start), 'SORT_NR'] = 
df.loc[(df['PROC_ID'] == proc_id) & (df['SORT_NR'] >= start), 'SORT_NR'] + add

becomes:

sorted_procs = (df['PROC_ID'] == proc_id) & (df['SORT_NR'] >= start)
df.loc[sorted_procs, 'SORT_NR'] += add

I don't know about R, but in Python it's common to structure things this way in complex operations, it's part of the zen of python. The way I've written it is more conducive to readability. It's clear what each line does just by glancing at it, and they can be reused later.

Your R example does look more succinct, but Python is much more general purpose so oneliners like that don't necessarily fit within the design goals. You're right that there are more characters to represent certain operations, but that is because pandas was designed for python, which is not a "data-first" type language.

So to answer your question, in cases like this, there is more repetition with pandas, and writing with zen will make it easier to read.

like image 110
Adam Smith Avatar answered Oct 16 '22 11:10

Adam Smith