Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis [closed]

Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).

I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states: "Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks." I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where J.F. Sebastian writes "...There is no difference between C++ and numpy on my machine."

Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?

Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.

like image 527
warship Avatar asked Jul 31 '14 00:07

warship


People also ask

Is SciPy faster than NumPy?

NumPy is written in C and so has a faster computational speed. SciPy is written in Python and so has a slower execution speed but vast functionality.

Is NumPy slower than C?

As you can see NumPy is incredibly fast, but always a bit slower than pure C.

Why do we use Python instead of C++ in data analytics?

Python leads to one conclusion: Python is better for beginners in terms of its easy-to-read code and simple syntax. Additionally, Python is a good option for web development (backend), while C++ is not very popular in web development of any kind. Python is also a leading language for data analysis and machine learning.


2 Answers

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.


As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.

Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

like image 86
Henry Keiter Avatar answered Sep 30 '22 18:09

Henry Keiter


The short answer is that for simple problems, then there should not be much difference. If you want to do anything complicated, then you will quickly run into stark performance differences.

As a simple example, try adding three vectors together

a = b + c + d

In python, as I understand it, this generally adds b to c, adds the result to d, and then make a point to that final result. Each of those operations can be fast since they are just farmed out to a BLAS library. However, if the vectors are large, then the intermediate result can not be stored in cache. Moving that intermediate result to main memory is slow.

You can do the same thing in C++ using valarray and it will be equivalently slow. However, you can also do something else

for(int i=0; i<N; ++i)
  a[i] = b[i] + c[i] + d[i]

This gets rid of the intermediate result and makes the code less sensitive to speed to main memory.

Doing the equivalent thing in python is possible, but python's looping constructs are not as efficient. They do nice things like bounds checks, but sometimes it is faster to run with the safeties disengaged. Java, for example, does a fair amount of work to remove bounds checks. So if you had a sufficiently smart compiler/JIT, python's loops could be fast. In practice, that has not worked out.

like image 22
Damascus Steel Avatar answered Sep 30 '22 16:09

Damascus Steel