Import TSV File in spark

Tags:

scala

apache-spark

I am new to spark so forgive me for asking a basic question. I'm trying to import my tsv file into spark but I'm not sure if its working.

val lines = sc.textFile("/home/cloudera/Desktop/Test/test.tsv
val split_lines = lines.map(_.split("\t"))
split_lines.first()

Everything seems to be working fine. Is there a way I can see if the tsv file has loaded properly?

SAMPLE DATA: (all using tabs as spaces)

hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda

891

asked Jun 01 '14 07:06

user3657361

1 Answers

So far, your code looks good to me. If you print that first line to the console, do you see the expected data?

To explore the Spark API, the best thing to do is to use the Spark-shell, a Scala REPL enriched with Spark-specifics that builds a default Spark Context for you.

It will let you explore the data a lot easier.

Here's an example loading ~65k lines csv file. Similar usecase to what you're doing, I guess:

$><spark_dir>/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
      /_/

scala> val lines=sc.textFile("/home/user/playground/ts-data.csv")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> val csv=lines.map(_.split(";"))
csv: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14

scala> csv.count
(... spark processing ...)
res0: Long = 67538


// let's have a look at the first record
scala> csv.first
14/06/01 12:22:17 INFO SparkContext: Starting job: first at <console>:17
14/06/01 12:22:17 INFO DAGScheduler: Got job 1 (first at <console>:17) with 1 output partitions (allowLocal=true)
14/06/01 12:22:17 INFO DAGScheduler: Final stage: Stage 1(first at <console>:17)
14/06/01 12:22:17 INFO DAGScheduler: Parents of final stage: List()
14/06/01 12:22:17 INFO DAGScheduler: Missing parents: List()
14/06/01 12:22:17 INFO DAGScheduler: Computing the requested partition locally
14/06/01 12:22:17 INFO HadoopRDD: Input split: file:/home/user/playground/ts-data.csv:0+1932934
14/06/01 12:22:17 INFO SparkContext: Job finished: first at <console>:17, took 0.003210457 s
res1: Array[String] = Array(20140127, 0000df, d063b4, ***, ***-Service,app180000m,49)

// groupby id - count unique
scala> csv.groupBy(_(4)).count
(... Spark processing ...)
res2: Long = 37668

// records per day
csv.map(record => record(0)->1).reduceByKey(_+_).collect
(... more Spark processing ...)
res8: Array[(String, Int)] = Array((20140117,1854), (20140120,2028), (20140124,3398), (20140131,6084), (20140122,5076), (20140128,8310), (20140123,8476), (20140127,1932), (20140130,8482), (20140129,8488), (20140118,5100), (20140109,3488), (20140110,4822))

* Edit using data added to the question *

val rawData="""hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda"""
//split lines
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
// Split using space as separator
val byId=rdd.map(_.split(" ")).groupBy(_(1))
byId.count
res11: Long = 3

118

answered Sep 19 '22 02:09

maasg

Related questions
                            
                                Why is headOption faster
                            
                                How do I resolve "error: bad symbolic reference" for dependencies using maven-scala plugin?
                            
                                Can I set cookies before returning an action in Play Framework 2?
                            
                                Scala Pickling usage MyObject -> Array[Byte] -> MyObject
                            
                                How to build a dynamic sequence in a scala macro?
                            
                                sbt does not resolve Typesafe repository
                            
                                Transform an M[A => B] to an A => M[B]
                            
                                SBT Resolvers work in build.sbt, not working in Build.scala
                            
                                Why are Scala for loops slower than logically identical while loops? [duplicate]
                            
                                Type variable can only be introduced in match if it's lower-case?
                            
                                Trait self type bound: A with B but not A with C
                            
                                Scala value class compilation fails for base type with partial-function-parameter method
                            
                                Using a scala generic class in java
                            
                                Spark/Scala Opening Zipped CSV Files
                            
                                Define an object extending PartialFunction, implement directly with cases
                            
                                Escaping scala reseverd word "type" in scala templete in play framework
                            
                                How to bind a array form field with Playframework scala [closed]
                            
                                specs2: How to use "failtrace" option
                            
                                Benefits from using akka dispatcher vs scala global
                            
                                Scala inconsistence behavior for final vals with and without type ascription

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With