I have a dataset, which contains lines in the format (tab separated):
Title<\t>Text
Now for every word in Text
, I want to create a (Word,Title)
pair.
For instance:
ABC Hello World
gives me
(Hello, ABC)
(World, ABC)
Using Scala, I wrote the following:
val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))
But this gives me the following output:
[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285
How do I solve this?
Further reading
What I want to achieve is to count the number of Words
that occur in a Text
for a particular Title
.
The subsequent code that I have written is:
val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)
But it does not work. Please feel free to give your inputs on this. Thanks!
Spark SQL split() is grouped under Array Functions in Spark SQL Functions class with the below syntax. The split() function takes the first argument as the DataFrame column of type String and the second argument string delimiter that you want to split on. You can also use the pattern as a delimiter.
It splits the dataset into these two parts using the trainRatio parameter. For example with trainRatio=0.75 , TrainValidationSplit will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.
val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]
val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]
val yourRdd = splitRdd.flatMap( arr => {
val title = arr( 0 )
val text = arr( 1 )
val words = text.split( " " )
words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]
// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )
// if you want to count ( this count is for non-unique words),
val countRdd = yourRdd
.groupBy( { case ( word, title ) => title } ) // group by title
.map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With