Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting strings in Apache Spark using Scala

I have a dataset, which contains lines in the format (tab separated):

Title<\t>Text

Now for every word in Text, I want to create a (Word,Title) pair. For instance:

ABC      Hello World

gives me

(Hello, ABC)
(World, ABC)

Using Scala, I wrote the following:

val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))

But this gives me the following output:

[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285

How do I solve this?

Further reading

What I want to achieve is to count the number of Words that occur in a Text for a particular Title.

The subsequent code that I have written is:

val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)

But it does not work. Please feel free to give your inputs on this. Thanks!

like image 619
AngryPanda Avatar asked Apr 23 '15 10:04

AngryPanda


People also ask

How do you split a string in Scala spark?

Spark SQL split() is grouped under Array Functions in Spark SQL Functions class with the below syntax. The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter.

How is data split in spark?

It splits the dataset into these two parts using the trainRatio parameter. For example with trainRatio=0.75 , TrainValidationSplit will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.


1 Answers

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title
like image 177
sarveshseri Avatar answered Oct 08 '22 19:10

sarveshseri