Equivalent to getLines in Apache Spark RDD

Tags:

apache-spark

I have a Scala program that works fine on a single computer. However, I'd like to get it working on multiple nodes.

The start of the program looks like this:

val filename = Source.fromFile("file://...")

val lines = filename.getLines

val linesArray = lines.map(x => x.split("   ").slice(0, 3))

val mapAsStrings = linesArray.toList.groupBy(_(0)).mapValues(x => x.map(_.tail))

val mappedUsers = mapAsStrings map {case (k,v) => k -> v.map(x => x(0) -> x(1).toInt).toMap}

When trying to use Spark to run the program I know I need a SparkContext and SparkConf object, and they are used to create the RDD.

So now I have:

class myApp(filePath: String) {

private val conf = new SparkConf().setAppName("myApp")
private val sc = new SparkContext(conf)
private val inputData = sc.textFile(filePath)

inputData is now an RDD, its equivalent in the previous program was filename (I assume). For an RDD the methods are different. So, what is the equivalent to getLines? Or is there no equivalent? I'm having a hard time visualising what the RDD gives me to work with, e.g. is inputData an Array[String] or something else?

Thanks

262

asked Dec 10 '14 18:12

monster

2 Answers

The documentation seems to answer this directly:

def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

So textFile is the equivalent of both fromFile and getLines, and returns an RDD where each entry is a line from the file. inputData is the equivalent of linesArray

answered Oct 31 '22 07:10

The Archetypal Paul

An RDD is a distributed collection, so conceptually it's not very different to a List, an Array or a Seq, providing you with functional operations that lets you transform the collection of elements. The main difference with the Scala collections is that an RDD in inherent distributed. Given a Spark cluster, when an RDD is created, the collection it represents is partitioned over some nodes of that cluster.

rdd.textFile(...) returns an RDD[String]. Given a distributed file system each worker will load a piece or that file into a 'partition', where further transformations and actions (in Spark lingo) can take place.

Given that the Spark API resembles quite closely the Scala collections API, once you have an RDD, applying functional transformations on it is quite similar to what you would do using a Scala collection.

Your Scala program can therefore be easily ported to Spark:

//val filename = Source.fromFile("file://...")
//val lines = filename.getLines
val rdd = sc.textFile("file://...")

//val linesArray = lines.map(x => x.split("   ").slice(0, 3))
val lines = rdd.map(x => x.split("   ").slice(0, 3))

//val mapAsStrings = linesArray.toList.groupBy(_(0)).mapValues(x => x.map(_.tail))
val mappedLines = lines.groupBy(_(0)).mapValues(x => x.map(_.tail))

//val mappedUsers = mapAsStrings map {case (k,v) => k -> v.map(x => x(0) -> x(1).toInt).toMap}
val mappedUsers = mappedLines.mapValues{v => v.map(x => x(0) -> x(1).toInt).toMap}

One important difference is that there's no associative 'Map' collection as an RDD. Therefore, the mappedUsers is a collection of tuples (String, Map[String,String])

answered Oct 31 '22 07:10

maasg

Related questions
                            
                                Reifying the function implementation instead of the reference
                            
                                Deal with executing Unix command which produces an endless output
                            
                                Akka-remote over the Internet
                            
                                Can Akka Actors replace Service layer?
                            
                                Why is Akka good for scaling "up" and "out"?
                            
                                "macro implementation reference has wrong shape" in the Scala Documentation examples
                            
                                How to use JSR-223 to get Scala interpreter in sbt console?
                            
                                Scala 2.11 + Neo4j - parboiled dependency of Cypher compiler
                            
                                Why can't a class val parameter be call-by-name?
                            
                                How to use custom executor in Akka dispatcher
                            
                                Anorm LIKE clause with String Interpolation
                            
                                spray.can.Http$ConnectionException: Premature connection close
                            
                                (In Scala,) Is there anything that can be done with generic type parameters of classes but not with abstract type members?
                            
                                Play Framework template doesn't have Html type
                            
                                Can I limit generic type argument to be one of 2 unrelated classes?
                            
                                Scala Mutable Option?
                            
                                Is Scala Either really a Monad
                            
                                Run sbt with -Ylog-classpath
                            
                                How to make Eclipse see the changes in Play! compiled templates?
                            
                                Spark SQL DataFrame - distinct() vs dropDuplicates()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With