Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Import TSV File in spark

I am new to spark so forgive me for asking a basic question. I'm trying to import my tsv file into spark but I'm not sure if its working.

val lines = sc.textFile("/home/cloudera/Desktop/Test/test.tsv
val split_lines = lines.map(_.split("\t"))
split_lines.first()

Everything seems to be working fine. Is there a way I can see if the tsv file has loaded properly?

SAMPLE DATA: (all using tabs as spaces)

hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda
like image 891
user3657361 Avatar asked Jun 01 '14 07:06

user3657361


People also ask

How do I read a Pandas TSV file?

How to read TSV file in pandas? TSV stands for Tab Separated File Use pandas which is a text file where each field is separated by tab (\t). In pandas, you can read the TSV file into DataFrame by using the read_table() function.


1 Answers

So far, your code looks good to me. If you print that first line to the console, do you see the expected data?

To explore the Spark API, the best thing to do is to use the Spark-shell, a Scala REPL enriched with Spark-specifics that builds a default Spark Context for you.

It will let you explore the data a lot easier.

Here's an example loading ~65k lines csv file. Similar usecase to what you're doing, I guess:

$><spark_dir>/bin/spark-shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.0.0-SNAPSHOT
      /_/

scala> val lines=sc.textFile("/home/user/playground/ts-data.csv")
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12

scala> val csv=lines.map(_.split(";"))
csv: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14

scala> csv.count
(... spark processing ...)
res0: Long = 67538


// let's have a look at the first record
scala> csv.first
14/06/01 12:22:17 INFO SparkContext: Starting job: first at <console>:17
14/06/01 12:22:17 INFO DAGScheduler: Got job 1 (first at <console>:17) with 1 output partitions (allowLocal=true)
14/06/01 12:22:17 INFO DAGScheduler: Final stage: Stage 1(first at <console>:17)
14/06/01 12:22:17 INFO DAGScheduler: Parents of final stage: List()
14/06/01 12:22:17 INFO DAGScheduler: Missing parents: List()
14/06/01 12:22:17 INFO DAGScheduler: Computing the requested partition locally
14/06/01 12:22:17 INFO HadoopRDD: Input split: file:/home/user/playground/ts-data.csv:0+1932934
14/06/01 12:22:17 INFO SparkContext: Job finished: first at <console>:17, took 0.003210457 s
res1: Array[String] = Array(20140127, 0000df, d063b4, ***, ***-Service,app180000m,49)

// groupby id - count unique
scala> csv.groupBy(_(4)).count
(... Spark processing ...)
res2: Long = 37668

// records per day
csv.map(record => record(0)->1).reduceByKey(_+_).collect
(... more Spark processing ...)
res8: Array[(String, Int)] = Array((20140117,1854), (20140120,2028), (20140124,3398), (20140131,6084), (20140122,5076), (20140128,8310), (20140123,8476), (20140127,1932), (20140130,8482), (20140129,8488), (20140118,5100), (20140109,3488), (20140110,4822))

* Edit using data added to the question *

val rawData="""hastag 200904 24 blackcat
hastag 200908 1 oaddisco
hastag 200904 1 blah
hastag 200910 3 mydda"""
//split lines
val data= rawData.split("\n")
val rdd= sc.parallelize(data)
// Split using space as separator
val byId=rdd.map(_.split(" ")).groupBy(_(1))
byId.count
res11: Long = 3
like image 118
maasg Avatar answered Sep 19 '22 02:09

maasg