At what situation I can use Dask instead of Apache Spark? [closed]

Tags:

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame.

Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger dataframes in Pandas due to limited system memory.

Simple Answer:

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. ... Generally Dask is smaller and lighter weight than Spark.

I get to know below details from http://dask.pydata.org/en/latest/spark.html

Dask is light weighted
Dask is typically used on a single machine, but also runs well on a distributed cluster.
Dask to provides parallel arrays, dataframes, machine learning, and custom algorithms
Dask has an advantage for Python users because it is itself a Python library, so serialization and debugging when things go wrong happens more smoothly.
Dask gives up high-level understanding to allow users to express more complex parallel algorithms.
Dask is lighter weight and is easier to integrate into existing code and hardware.
If you want a single project that does everything and you’re already on Big Data hardware then Spark is a safe bet
Spark is typically used on small to medium sized cluster but also runs well on a single machine.

I understand more things about Dask from the below link https://www.continuum.io/blog/developer-blog/high-performance-hadoop-anaconda-and-dask-your-cluster

If you’re running into memory issues, storage limitations, or CPU boundaries on a single machine when using Pandas, NumPy, or other computations with Python, Dask can help you scale up on all of the cores on a single machine, or scale out on all of the cores and memory across your cluster.
Dask works well on a single machine to make use of all of the cores on your laptop and process larger-than-memory data
scales up resiliently and elastically on clusters with hundreds of nodes.
Dask works natively from Python with data in different formats and storage systems, including the Hadoop Distributed File System (HDFS) and Amazon S3. Anaconda and Dask can work with your existing enterprise Hadoop distribution, including Cloudera CDH and Hortonworks HDP.

http://dask.pydata.org/en/latest/dataframe-overview.html

Limitations

Dask.DataFrame does not implement the entire Pandas interface. Users expecting this will be disappointed.Notably, dask.dataframe has the following limitations:

Setting a new index from an unsorted column is expensive
Many operations, like groupby-apply and join on unsorted columns require setting the index, which as mentioned above, is expensive
The Pandas API is very large. Dask.dataframe does not attempt to implement many pandas features or any of the more exotic data structures like NDFrames

Thanks to the Dask developers. It seems like very promising technology.

Overall I can understand Dask is simpler to use than spark. Dask is as flexible as Pandas with more power to compute with more cpu's parallely.

I understand all the above facts about Dask.

So, roughly how much amount of data(in terabyte) can be processed with Dask?

510

asked Aug 10 '16 20:08

Hariprasad

1 Answers

you may want to read Dask comparison to Apache Spark

Apache Spark is an all-inclusive framework combining distributed computing, SQL queries, machine learning, and more that runs on the JVM and is commonly co-deployed with other Big Data frameworks like Hadoop. It was originally optimized for bulk data ingest and querying common in data engineering and business analytics but has since broadened out. Spark is typically used on small to medium sized cluster but also runs well on a single machine.

Dask is a parallel programming library that combines with the Numeric Python ecosystem to provide parallel arrays, dataframes, machine learning, and custom algorithms. It is based on Python and the foundational C/Fortran stack. Dask was originally designed to complement other libraries with parallelism, particular for numeric computing and advanced analytics, but has since broadened out. Dask is typically used on a single machine, but also runs well on a distributed cluster.

Generally Dask is smaller and lighter weight than Spark. This means that it has fewer features and instead is intended to be used in conjunction with other libraries, particularly those in the numeric Python ecosystem.

181

answered Oct 14 '22 13:10

MaxU - stop WAR against UA

Related questions
                            
                                TypeError: 'RelatedManager' object is not iterable
                            
                                django abstract models versus regular inheritance
                            
                                How to group pandas DataFrame entries by date in a non-unique column
                            
                                List comprehension without [ ] in Python
                            
                                Python urllib2: Receive JSON response from url
                            
                                Make more than one chart in same IPython Notebook cell
                            
                                ImportError: No module named win32com.client
                            
                                Python Script Uploading files via FTP
                            
                                Pandas/Python: Set value of one column based on value in another column
                            
                                Get Filename Without Extension in Python
                            
                                Subtract a value from every number in a list in Python?
                            
                                How do I run Selenium in Xvfb?
                            
                                Clear screen in shell
                            
                                Find common substring between two strings
                            
                                How should I verify a log message when testing Python code under nose?
                            
                                zsh: no matches found: requests[security]
                            
                                pip connection failure: cannot fetch index base URL http://pypi.python.org/simple/
                            
                                Python: Converting from ISO-8859-1/latin1 to UTF-8
                            
                                How to efficiently calculate a running standard deviation?
                            
                                How to properly use the "choices" field option in Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

At what situation I can use Dask instead of Apache Spark? [closed]

Tags:

python

pandas

apache-spark

dask

Hariprasad

People also ask

1 Answers

MaxU - stop WAR against UA

Recent Activity

Donate For Us