Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read whole file in one string

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFile(json or xml) 

Input

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

Input is spread across multiple lines.

Expected Output {"employees:[{"firstName:"John",......]}

How to get the complete file in a single line using pyspark?

like image 412
Kumar Avatar asked May 25 '15 20:05

Kumar


People also ask

What is the best way to read an entire file into a single string?

The FileUtils. readFileToString() is an excellent way to read a whole file into a String in a single statement.

How can you read an entire file and get each line?

Method 1: Read a File Line by Line using readlines() readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.


4 Answers

If your data is not formed on one line as textFile expects, then use wholeTextFiles.

This will give you the whole file so that you can parse it down into whatever format you would like.

like image 126
Justin Pihony Avatar answered Sep 28 '22 02:09

Justin Pihony


This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))
like image 44
Animesh Raj Jha Avatar answered Sep 28 '22 00:09

Animesh Raj Jha


Python way

rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]
like image 29
ankursingh1000 Avatar answered Sep 28 '22 01:09

ankursingh1000


There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.) textFile

input: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.) wholeTextFiles

input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.) "Labeled" textFile

input:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

like image 44
abby sobh Avatar answered Sep 28 '22 00:09

abby sobh