It made me happy when I heard about <code>parallelStream()</code> in Java 8, that processes on multiple cores and finally gives back the result within single JVM. No more lines of multithreading code. As far as I understand this is valid for single JVM only. But what if I want to distribute the processing across different JVMs on a single host or even multiple hosts? Does Java 8 include any abstraction for simplifying it? In a tutorial at dreamsyssoft.com a list of users <pre class="prettyprint"><code>private static List<User> users = Arrays.asList( new User(1, "Steve", "Vai", 40), new User(4, "Joe", "Smith", 32), new User(3, "Steve", "Johnson", 57), new User(9, "Mike", "Stevens", 18), new User(10, "George", "Armstrong", 24), new User(2, "Jim", "Smith", 40), new User(8, "Chuck", "Schneider", 34), new User(5, "Jorje", "Gonzales", 22), new User(6, "Jane", "Michaels", 47), new User(7, "Kim", "Berlie", 60) ); </code></pre> is processed to get their average age like this: <pre class="prettyprint"><code>double average = users.parallelStream().map(u -> u.age).average().getAsDouble(); </code></pre> In this case it is processed on single host. My question is: Can it be processed utilizing multiple hosts? E.g. <code>Host1</code> processes the list below and returns <code>average1</code> for five users: <pre class="prettyprint"><code>new User(1, "Steve", "Vai", 40), new User(4, "Joe", "Smith", 32), new User(3, "Steve", "Johnson", 57), new User(9, "Mike", "Stevens", 18), new User(10, "George", "Armstrong", 24), </code></pre> Similarly <code>Host2</code> processes the list below and returns <code>average2</code> for remaining five users: <pre class="prettyprint"><code>new User(2, "Jim", "Smith", 40), new User(8, "Chuck", "Schneider", 34), new User(5, "Jorje", "Gonzales", 22), new User(6, "Jane", "Michaels", 47), new User(7, "Kim", "Berlie", 60) </code></pre> Finally <code>Host3</code> computes final result like: <pre class="prettyprint"><code>average = (average1 + average2) / 2 </code></pre> Using distributed architecture it can be solved like remoting. Does Java 8 have some simpler way to solve the problem with some abstraction for it? I know frameworks like Hadoop, Akka and Promises solve it. I am talking about pure Java 8. Can I get any docummentation and examples for <code>parallelStream()</code> for multiple hosts?

Here is the list of features scheduled for Java 8 as of September 2013. As you can see, there is no feature dedicated to standardizing distributed computing over a cluster. The closest you have is JEP 107, which builds on the Fork/Join framework in JDK 7 to leverage multi-core CPU's. In Java 8, you will be able to use lambda expressions to perform bulk operations on collections in parallel by dividing the task among multiple processors. Java 8 is also scheduled to feature JEP 103, which will also build on Java 7 Fork/Join to sort arrays in parallel. Meanwhile, since Fork/Join is clearly a big deal, it evolves further with JEP 155. So there are no core Java 8 abstractions for distributed computing over a cluster--only over multiple cores. You will need to devise your own solution for real distributed computing using existing facilities. As disappointing as that may be, I would point out that there are still wonderful open-source third party abstractions over Hadoop out there like Cascalog and Apache Spark. Spark in particular lets you perform operations on your data in a distributed way through the RDD abstraction, which makes it feel like your data is just in a fancy array. But you will have to wait for such things in core Java.

Java 8 MapReduce for distributed computing

Tags:

java

akka

hadoop

mapreduce

cluster-computing

It made me happy when I heard about parallelStream() in Java 8, that processes on multiple cores and finally gives back the result within single JVM. No more lines of multithreading code. As far as I understand this is valid for single JVM only.

But what if I want to distribute the processing across different JVMs on a single host or even multiple hosts? Does Java 8 include any abstraction for simplifying it?

In a tutorial at dreamsyssoft.com a list of users

private static List<User> users = Arrays.asList(
    new User(1, "Steve", "Vai", 40),
    new User(4, "Joe", "Smith", 32),
    new User(3, "Steve", "Johnson", 57),
    new User(9, "Mike", "Stevens", 18),
    new User(10, "George", "Armstrong", 24),
    new User(2, "Jim", "Smith", 40),
    new User(8, "Chuck", "Schneider", 34),
    new User(5, "Jorje", "Gonzales", 22),
    new User(6, "Jane", "Michaels", 47),
    new User(7, "Kim", "Berlie", 60)
);

is processed to get their average age like this:

double average = users.parallelStream().map(u -> u.age).average().getAsDouble();

In this case it is processed on single host.

My question is: Can it be processed utilizing multiple hosts?

E.g. Host1 processes the list below and returns average1 for five users:

new User(1, "Steve", "Vai", 40),
new User(4, "Joe", "Smith", 32),
new User(3, "Steve", "Johnson", 57),
new User(9, "Mike", "Stevens", 18),
new User(10, "George", "Armstrong", 24),

Similarly Host2 processes the list below and returns average2 for remaining five users:

new User(2, "Jim", "Smith", 40),
new User(8, "Chuck", "Schneider", 34),
new User(5, "Jorje", "Gonzales", 22),
new User(6, "Jane", "Michaels", 47),
new User(7, "Kim", "Berlie", 60)

Finally Host3 computes final result like:

average = (average1 + average2)  / 2

Using distributed architecture it can be solved like remoting. Does Java 8 have some simpler way to solve the problem with some abstraction for it?

I know frameworks like Hadoop, Akka and Promises solve it. I am talking about pure Java 8. Can I get any docummentation and examples for parallelStream() for multiple hosts?

645

asked Dec 05 '13 09:12

abishkar bhattarai

1 Answers

Here is the list of features scheduled for Java 8 as of September 2013.

As you can see, there is no feature dedicated to standardizing distributed computing over a cluster. The closest you have is JEP 107, which builds on the Fork/Join framework in JDK 7 to leverage multi-core CPU's. In Java 8, you will be able to use lambda expressions to perform bulk operations on collections in parallel by dividing the task among multiple processors.

Java 8 is also scheduled to feature JEP 103, which will also build on Java 7 Fork/Join to sort arrays in parallel. Meanwhile, since Fork/Join is clearly a big deal, it evolves further with JEP 155.

So there are no core Java 8 abstractions for distributed computing over a cluster--only over multiple cores. You will need to devise your own solution for real distributed computing using existing facilities.

As disappointing as that may be, I would point out that there are still wonderful open-source third party abstractions over Hadoop out there like Cascalog and Apache Spark. Spark in particular lets you perform operations on your data in a distributed way through the RDD abstraction, which makes it feel like your data is just in a fancy array.

But you will have to wait for such things in core Java.

answered Oct 04 '22 11:10

Vidya

Related questions
                            
                                How to configure TransactionManager programmatically
                            
                                Bad type on operand stack ... using jdk 8, lambdas with anonymous inner classes fails, why?
                            
                                What JVM does Intellij Idea use to launch with?
                            
                                How can I reduce Google App Engine datastore latency?
                            
                                AES encrypt with openssl decrypt using java
                            
                                What is the right way to sign POST requests with OAuth-Signpost and Apache HttpComponents?
                            
                                Extracting webpage information based on a template in Java
                            
                                mvn jetty:run does not find my LoginService
                            
                                getResourceAsStream filepath while running .jar
                            
                                how push notification (java/ servlet) for web application?
                            
                                Is reordering of instance initialization and assignment to a shared variable possible?
                            
                                How to represent generic parameter in UML method?
                            
                                Tomcat process killed by Linux kernel after running out of swap space; don't get any JVM OutOfMemory error
                            
                                What does @Transactional do? [duplicate]
                            
                                Writing a regex to detect repeat-characters [duplicate]
                            
                                what is the purpose of javax StreamSource
                            
                                JPA Criteria API: LEFT JOIN for optional relationships
                            
                                Map clear vs null
                            
                                Spring PageableArgumentResolver deprecated, how to use PageableHandlerMethodArgumentResolver?
                            
                                contextstoppedevent vs contextclosedevent in spring?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With