How to denormalize YAML for Pandas Dataframe?

Tags:

I am trying to get data from a YAML file into a Pandas DataFrame. Take the following example data.yml:

---
 - doc: "Book1"
   reviews:
     - reviewer: "Paul"
       stars: "5"
     - reviewer: "Sam"
       stars: "2"
 - doc: "Book2"
   reviews:
     - reviewer: "John"
       stars: "4"
     - reviewer: "Sam"
       stars: "3"
     - reviewer: "Pete"
       stars: "2"
...

The desired DataFrame would look like this:

     doc reviews.reviewer reviews.stars
0  Book1             Paul             5
1  Book1              Sam             2
2  Book2             John             4
3  Book2              Sam             3
4  Book2             Pete             2

I've tried feeding the YAML data to Pandas different ways (like with open('data.yml') as f: data = pd.DataFrame(yaml.load(f))), but the cells always contain the nested dicts. This solution works for general JSON data, but it's quite a bit of code and it seems like a simpler solution for YAML might exist.

Is there a built-in or Pythonic way to denormalize YAML for conversion to a Pandas Dataframe in this way?

251

asked Jan 18 '19 17:01

user1717828

2 Answers

You should use json_normalize to flatten the dictionary after YAML loads:

pd.io.json.json_normalize(yaml.load(f), 'reviews', 'doc')

  reviewer stars    doc
0     Paul     5  Book1
1      Sam     2  Book1
2     John     4  Book2
3      Sam     3  Book2
4     Pete     2  Book2

135

answered Oct 06 '22 10:10

cs95

Using above now leads to FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead

# lets say the yaml file is test_sample.yml
from pandas import json_normalize
from os import getcwd, path
from yaml import SafeLoader, load

path_to_yaml = path.join(getcwd(), ..., "test_sample.yaml")
with open(path_to_yaml) as yaml_file:
    yaml_contents = load(path_to_file, Loader=SafeLoader)
yaml_df = json_normalize(yaml_contents)

answered Oct 06 '22 09:10

Salil Shenoy

Related questions
                            
                                AWS lambda CLI 'update-function-code' does not update lambda_handler file
                            
                                How can I launch pyqt GUI multiple times consequtively in a process?
                            
                                How to normalize a non-normal distribution?
                            
                                DNNClassifier: 'DataFrame' object has no attribute 'dtype'
                            
                                Mark every Nth row per group using pandas
                            
                                Python generate a mask for the lower triangle of a matrix
                            
                                Pandas .str.replace and case insensitivity
                            
                                Generate 'K' Nearest Neighbours to a datapoint
                            
                                Create a tree from a given dictionary
                            
                                tensorflow sparse categorical cross entropy with logits
                            
                                What exactly defines a function in Python
                            
                                how to insert a element at specific index in python list
                            
                                ValueError: You are trying to load a weight file containing 6 layers into a model with 0
                            
                                how to return the order index of each element of a list? [duplicate]
                            
                                React Tutorial history map (step, move)
                            
                                pythonic style for functional programming
                            
                                Tensorflow: Different results with the same random seed
                            
                                Top N rows by group using python datatable
                            
                                Read excel file from S3 into Pandas DataFrame
                            
                                Django: Run a script right after runserver

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to denormalize YAML for Pandas Dataframe?

Tags:

python

pandas

dataframe

yaml

denormalization

user1717828

People also ask

2 Answers

cs95

Salil Shenoy

Recent Activity

Donate For Us