Splitting strings in Apache Spark using Scala

Tags:

I have a dataset, which contains lines in the format (tab separated):

Title<\t>Text

Now for every word in Text, I want to create a (Word,Title) pair. For instance:

ABC      Hello World

gives me

(Hello, ABC)
(World, ABC)

Using Scala, I wrote the following:

val file = sc.textFile("s3n://file.txt")
val title = file.map(line => line.split("\t")(0))
val wordtitle = file.map(line => (line.split("\t")(1).split(" ").map(word => (word, line.split("\t")(0)))))

But this gives me the following output:

[Lscala.Tuple2;@2204b589
[Lscala.Tuple2;@632a46d1
[Lscala.Tuple2;@6c8f7633
[Lscala.Tuple2;@3e9945f3
[Lscala.Tuple2;@40bf74a0
[Lscala.Tuple2;@5981d595
[Lscala.Tuple2;@5aed571b
[Lscala.Tuple2;@13f1dc40
[Lscala.Tuple2;@6bb2f7fa
[Lscala.Tuple2;@32b67553
[Lscala.Tuple2;@68d0b627
[Lscala.Tuple2;@8493285

How do I solve this?

Further reading

What I want to achieve is to count the number of Words that occur in a Text for a particular Title.

The subsequent code that I have written is:

val wordcountperfile = file.map(line => (line.split("\t")(1).split(" ").flatMap(word => word), line.split("\t")(0))).map(word => (word, 1)).reduceByKey(_ + _)

But it does not work. Please feel free to give your inputs on this. Thanks!

619

asked Apr 23 '15 10:04

AngryPanda

1 Answers

So... In spark you work using distributed data structure called RDD. They provide functionality similar to scala's collection types.

val fileRdd = sc.textFile("s3n://file.txt")
// RDD[ String ]

val splitRdd = fileRdd.map( line => line.split("\t") )
// RDD[ Array[ String ]

val yourRdd = splitRdd.flatMap( arr => {
  val title = arr( 0 )
  val text = arr( 1 )
  val words = text.split( " " )
  words.map( word => ( word, title ) )
} )
// RDD[ ( String, String ) ]

// Now, if you want to print this...
yourRdd.foreach( { case ( word, title ) => println( s"{ $word, $title }" ) } )

// if you want to count ( this count is for non-unique words), 
val countRdd = yourRdd
  .groupBy( { case ( word, title ) => title } )  // group by title
  .map( { case ( title, iter ) => ( title, iter.size ) } ) // count for every title

177

answered Oct 08 '22 19:10

sarveshseri

Related questions
                            
                                how to build 2^n changed word from word with length n in sql server
                            
                                Search for a newline Character C#.net
                            
                                Cannot implicitly convert type string to byte[]
                            
                                Tcl: default if variable is empty?
                            
                                Sort a list of strings by last N characters in Python 2.3
                            
                                Converting Comma Separated Value to Double Quotes comma separated string
                            
                                Passing a string in a function (C)
                            
                                Can I convert a string to a math operation in java?
                            
                                Using Ogden’s Lemma versus regular Pumping Lemma for Context-Free Grammars
                            
                                carriage return by fgets
                            
                                Length of longest word in a list
                            
                                C++11 string initialization
                            
                                Combine two dictionaries, concatenate string values?
                            
                                Python's "re" module not working?
                            
                                Extract a substring between two words from a string
                            
                                GetBytes() returns negative number
                            
                                How to extract base path from DataFrame column of path strings
                            
                                How do I create a dictionary from a string returning the number of characters [duplicate]
                            
                                Split string lines and make a data frame
                            
                                what is the difference between a[0] and &a[0] in string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Splitting strings in Apache Spark using Scala

Tags:

string

scala

apache-spark

AngryPanda

People also ask

1 Answers

sarveshseri

Recent Activity

Donate For Us