Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis [closed]

Tags:

Doing Python on relatively small projects makes me appreciate the dynamically typed nature of this language (no need for declaration code to keep track of types), which often makes for a quicker and less painful development process along the way. However, I feel that in much larger projects this may actually be a hindrance, as the code would run slower than say, its equivalent in C++. But then again, using Numpy and/or Scipy with Python may get your code to run just as fast as a native C++ program (where the code in C++ would sometimes take longer to develop).

I post this question after reading Justin Peel's comment on the thread "Is Python faster and lighter than C++?" where he states: "Also, people who speak of Python being slow for serious number crunching haven't used the Numpy and Scipy modules. Python is really taking off in scientific computing these days. Of course, the speed comes from using modules written in C or libraries written in Fortran, but that's the beauty of a scripting language in my opinion." Or as S. Lott writes on the same thread regarding Python: "...Since it manages memory for me, I don't have to do any memory management, saving hours of chasing down core leaks." I also inspected a Python/Numpy/C++ related performance question on "Benchmarking (python vs. c++ using BLAS) and (numpy)" where J.F. Sebastian writes "...There is no difference between C++ and numpy on my machine."

Both of these threads got me to wondering whether there is any real advantage conferred to knowing C++ for a Python programmer that uses Numpy/Scipy for producing software to analyze 'big data' where performance is obviously of great importance (but also code readability and development speed are a must)?

Note: I'm especially interested in handling huge text files. Text files on the order of 100K-800K lines with multiple columns, where Python could take a good five minutes to analyze a file "only" 200K lines long.

527

asked Jul 31 '14 00:07

warship

2 Answers

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.

As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.

Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

answered Sep 30 '22 18:09

Henry Keiter

The short answer is that for simple problems, then there should not be much difference. If you want to do anything complicated, then you will quickly run into stark performance differences.

As a simple example, try adding three vectors together

a = b + c + d

In python, as I understand it, this generally adds b to c, adds the result to d, and then make a point to that final result. Each of those operations can be fast since they are just farmed out to a BLAS library. However, if the vectors are large, then the intermediate result can not be stored in cache. Moving that intermediate result to main memory is slow.

You can do the same thing in C++ using valarray and it will be equivalently slow. However, you can also do something else

for(int i=0; i<N; ++i)
  a[i] = b[i] + c[i] + d[i]

This gets rid of the intermediate result and makes the code less sensitive to speed to main memory.

Doing the equivalent thing in python is possible, but python's looping constructs are not as efficient. They do nice things like bounds checks, but sometimes it is faster to run with the safeties disengaged. Java, for example, does a fair amount of work to remove bounds checks. So if you had a sufficiently smart compiler/JIT, python's loops could be fast. In practice, that has not worked out.

answered Sep 30 '22 16:09

Damascus Steel

Related questions
                            
                                annotate a plot using matplotlib
                            
                                Django ORM query GROUP BY multiple columns combined by MAX
                            
                                Is there a working OAuth library for Python 3?
                            
                                Add information to matplotlib Navigation toolbar/status bar?
                            
                                Is it bad practice to write a whole Flask application in one file?
                            
                                Combining two time series in pandas
                            
                                Scanning a list
                            
                                jira python oauth: how to get the parameters for authentication?
                            
                                Pandas compiled from source: default pickle behavior changed
                            
                                Why is my scoped_session raising an AttributeError: 'Session' object has no attribute 'remove'
                            
                                Customize (override) Flask-Admin's Submit method from edit view
                            
                                How to assign a plot to a variable and use the variable as the return value in a Python function
                            
                                matplotlib show() method does not open window
                            
                                How to make python's argparse generate Non-English text?
                            
                                How to find zero crossings with hysteresis?
                            
                                Converting bsxfun with @times to numpy
                            
                                Tornado Get Reference to Instance Variable in RequestHandler
                            
                                Converting an RGB image to grayscale and manipulating the pixel data in python
                            
                                How to speed up communication with subprocesses
                            
                                Read Matlab Data File into Python, Need to Export to CSV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python with Numpy/Scipy vs. Pure C++ for Big Data Analysis [closed]

Tags:

c++

python

benchmarking

numpy

scipy

warship

People also ask

2 Answers

Henry Keiter

Damascus Steel

Recent Activity

Donate For Us