I want to read json or xml file in pyspark.lf my file is split in multiple line in <pre class="prettyprint"><code>rdd= sc.textFile(json or xml) </code></pre> Input <pre class="prettyprint"><code>{ " employees": [ { "firstName":"John", "lastName":"Doe" }, { "firstName":"Anna" ] } </code></pre> Input is spread across multiple lines. Expected Output <code>{"employees:[{"firstName:"John",......]}</code> How to get the complete file in a single line using pyspark?

If your data is not formed on one line as <code>textFile</code> expects, then use <code>wholeTextFiles</code>. This will give you the whole file so that you can parse it down into whatever format you would like.

This is how you would do in scala <pre class="prettyprint"><code>rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt") rdd.collect.foreach(t=>println(t._2)) </code></pre>

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark: textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files). 1.) textFile input: <code>rdd = sc.textFile('/home/folder_with_text_files/input_file')</code> output: array containing 1 line of file as each entry ie. [line1, line2, ...] 2.) wholeTextFiles input: <code>rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')</code> output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie. [(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...] 3.) "Labeled" textFile input: <pre class="prettyprint"><code>import glob from pyspark import SparkContext SparkContext.stop(sc) sc = SparkContext("local","example") # if running locally sqlContext = SQLContext(sc) for filename in glob.glob(Data_File + "/*"): Spark_Full += sc.textFile(filename).keyBy(lambda x: filename) </code></pre> output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie. <pre class="prettyprint"><code>[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'), ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'), ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'), ...] </code></pre> <hr> You can also recombine either as a list of lines: <code>Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()</code> <pre class="prettyprint"><code>[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']), ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])] </code></pre> Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.): <code>Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()</code>

How to read whole file in one string

Tags:

json

apache-spark

apache-spark-sql

I want to read json or xml file in pyspark.lf my file is split in multiple line in

rdd= sc.textFile(json or xml)

Input

{
" employees":
[
 {
 "firstName":"John",
 "lastName":"Doe" 
},
 { 
"firstName":"Anna"
  ]
}

Input is spread across multiple lines.

Expected Output {"employees:[{"firstName:"John",......]}

How to get the complete file in a single line using pyspark?

412

asked May 25 '15 20:05

Kumar

4 Answers

If your data is not formed on one line as textFile expects, then use wholeTextFiles.

This will give you the whole file so that you can parse it down into whatever format you would like.

126

answered Sep 28 '22 02:09

Justin Pihony

This is how you would do in scala

rdd = sc.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
rdd.collect.foreach(t=>println(t._2))

answered Sep 28 '22 00:09

Animesh Raj Jha

Python way

rdd = spark.sparkContext.wholeTextFiles("hdfs://nameservice1/user/me/test.txt")
json = rdd.collect()[0][1]

answered Sep 28 '22 01:09

ankursingh1000

There are 3 ways (I invented the 3rd one, the first two are standard built-in Spark functions), solutions here are in PySpark:

textFile, wholeTextFile, and a labeled textFile (key = file, value = 1 line from file. This is kind of a mix between the two given ways to parse files).

1.) textFile

input: rdd = sc.textFile('/home/folder_with_text_files/input_file')

output: array containing 1 line of file as each entry ie. [line1, line2, ...]

2.) wholeTextFiles

input: rdd = sc.wholeTextFiles('/home/folder_with_text_files/*')

output: array of tuples, first item is the "key" with the filepath, second item contains 1 file's entire contents ie.

[(u'file:/home/folder_with_text_files/', u'file1_contents'), (u'file:/home/folder_with_text_files/', file2_contents), ...]

3.) "Labeled" textFile

input:

import glob
from pyspark import SparkContext
SparkContext.stop(sc)
sc = SparkContext("local","example") # if running locally
sqlContext = SQLContext(sc)

for filename in glob.glob(Data_File + "/*"):
    Spark_Full += sc.textFile(filename).keyBy(lambda x: filename)

output: array with each entry containing a tuple using filename-as-key with value = each line of file. (Technically, using this method you can also use a different key besides the actual filepath name- perhaps a hashing representation to save on memory). ie.

[('/home/folder_with_text_files/file1.txt', 'file1_contents_line1'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line2'),
 ('/home/folder_with_text_files/file1.txt', 'file1_contents_line3'),
 ('/home/folder_with_text_files/file2.txt', 'file2_contents_line1'),
  ...]

You can also recombine either as a list of lines:

Spark_Full.groupByKey().map(lambda x: (x[0], list(x[1]))).collect()

[('/home/folder_with_text_files/file1.txt', ['file1_contents_line1', 'file1_contents_line2','file1_contents_line3']),
 ('/home/folder_with_text_files/file2.txt', ['file2_contents_line1'])]

Or recombine entire files back to single strings (in this example the result is the same as what you get from wholeTextFiles, but with the string "file:" stripped from the filepathing.):

Spark_Full.groupByKey().map(lambda x: (x[0], ' '.join(list(x[1])))).collect()

answered Sep 28 '22 00:09

abby sobh

Related questions
                            
                                Parsing JSON with python: blank fields
                            
                                Spring mvc bad request with @requestbody
                            
                                How to show/hide element in ng-repeat based on value in object? Angularjs
                            
                                Create json with column values as object keys
                            
                                How to add element to JSON object using PHP? [duplicate]
                            
                                <T> cannot be resolved to a type?
                            
                                Remove escape sequence characters like newline, tab and carriage return from JSON file
                            
                                Dumping a JSON using tab indents (not spaces)
                            
                                Parsing an array encoded in JSON through perl
                            
                                Java: Printing out an object for debugging
                            
                                How can I get data in json array by ID
                            
                                How to parse JSON data when the property name is not known in advance?
                            
                                Can't cast JsonNull to JsonObject
                            
                                Custom Json Writes with combinators - not all the fields of the case class are needed
                            
                                Automatic conversion of JSON form parameter in Spring MVC 4.0
                            
                                Spring 4 and Rest WS integration
                            
                                Embedding JSON in a Rails 4 JS/ERB template
                            
                                Error: ansible requires a json module, none found
                            
                                How to retrieve data from file with SwiftyJSON
                            
                                What is the difference between JSONObject accumulate and put?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With