Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can we define a set of DSL operation in Scala that perform parallelly with each other just like using pipe-line processing in Linux

Forgive me my poor English but I will try my best to express my question.

Suppose I want to process a large text whose operation is to filter content through a key word; change them to lowercase; and then print them onto the standard output. As we all know, we can do this using pipeline in Linux BASH script :

cat article.txt | grep "I" | tr "I" "i" > /dev/stdout

where cat article.txt, grep "I", tr "I" "i" > /dev/stdout are running in parallel.

In Scala, we probably do it like this:

//or read from a text file , e.g. article.txt 
val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  
strList.filter( _ == "I").map(_.toLowerCase).foreach(println)

My question is how we can make filter, map and foreach parallel?

thx

like image 540
爱国者 Avatar asked Jan 17 '12 08:01

爱国者


2 Answers

In 2.9, parallel collections were added. To parallelize the loop, all you have to do is to convert it by calling the par member function.

Your code would look like this:

val strList = List("I", "am", "a" , "student", ".", "I", "come", "from", "China", ".","I","love","peace")  // or read from a text file , e.g. article.txt 
strList.par.filter( _ == "I").map(_.toLowerCase).foreach(println)
like image 120
tstenner Avatar answered Sep 18 '22 08:09

tstenner


tstenner's solution is probably the most efficiency solution in your situation, since it can achieve a high degree of parallelism (each single item could be theoretically processed in parallel). However, your bash example is just using pipeline parallelism and this kind of parallelism is unfortunately not directly supported by Scalas parallel collections.

To achieve pipeline parallelism your operators (filter, map, foreach) have to be executed by different threads, e.g., by using Actors.

In general I think it would be nice feature for Scala to have a simple API for that. But, for your example I doubt that pipeline parallelism would speedup your execution time that much. If you just use very simple filter and map operations I assume that the communication overhead (for FIFOs / Actor mailboxes) consumes the whole speedup of your parallel execution.

like image 29
Stefan Endrullis Avatar answered Sep 18 '22 08:09

Stefan Endrullis