Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a dataframe into JSON (in pyspark) and then selecting desired fields

I'm new to Spark. I have a dataframe that contains the results of some analysis. I converted that dataframe into JSON so I could display it in a Flask App:

results = result.toJSON().collect()

An example entry in my json file is below. I then tried to run a for loop in order to get specific results:

{"userId":"1","systemId":"30","title":"interest"}

for i in results:
    print i["userId"]

This doesn't work at all and I get errors such as: Python (json) : TypeError: expected string or buffer

I used json.dumps and json.loads and still nothing - I keep on getting errors such as string indices must be integers, as well as the above error.

I then tried this:

  print i[0]

This gave me the first character in the json "{" instead of the first line. I don't really know what to do, can anyone tell me where I'm going wrong?

Many Thanks.

like image 379
xn139 Avatar asked Apr 05 '17 13:04

xn139


People also ask

What does explode () do on a JSON field?

The explode() function breaks a string into an array.

How do I convert a DataFrame to JSON?

You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.


1 Answers

If the result of result.toJSON().collect() is a JSON encoded string, then you would use json.loads() to convert it to a dict. The issue you're running into is that when you iterate a dict with a for loop, you're given the keys of the dict. In your for loop, you're treating the key as if it's a dict, when in fact it is just a string. Try this:

# toJSON() turns each row of the DataFrame into a JSON string
# calling first() on the result will fetch the first row.
results = json.loads(result.toJSON().first())

for key in results:
    print results[key]

# To decode the entire DataFrame iterate over the result
# of toJSON()

def print_rows(row):
    data = json.loads(row)
    for key in data:
        print "{key}:{value}".format(key=key, value=data[key])


results = result.toJSON()
results.foreach(print_rows)    

    

EDIT: The issue is that collect returns a list, not a dict. I've updated the code. Always read the docs.

collect() Return a list that contains all of the elements in this RDD.

Note This method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver’s memory.

EDIT2: I can't emphasize enough, always read the docs.

EDIT3: Look here.

like image 53
Allie Fitter Avatar answered Sep 22 '22 07:09

Allie Fitter