Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to denormalize YAML for Pandas Dataframe?

I am trying to get data from a YAML file into a Pandas DataFrame. Take the following example data.yml:

---
 - doc: "Book1"
   reviews:
     - reviewer: "Paul"
       stars: "5"
     - reviewer: "Sam"
       stars: "2"
 - doc: "Book2"
   reviews:
     - reviewer: "John"
       stars: "4"
     - reviewer: "Sam"
       stars: "3"
     - reviewer: "Pete"
       stars: "2"
...

The desired DataFrame would look like this:

     doc reviews.reviewer reviews.stars
0  Book1             Paul             5
1  Book1              Sam             2
2  Book2             John             4
3  Book2              Sam             3
4  Book2             Pete             2

I've tried feeding the YAML data to Pandas different ways (like with open('data.yml') as f: data = pd.DataFrame(yaml.load(f))), but the cells always contain the nested dicts. This solution works for general JSON data, but it's quite a bit of code and it seems like a simpler solution for YAML might exist.

Is there a built-in or Pythonic way to denormalize YAML for conversion to a Pandas Dataframe in this way?

like image 251
user1717828 Avatar asked Jan 18 '19 17:01

user1717828


People also ask

Can you use Numba with pandas?

Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or Dataframe (using to_numpy() ) into the function.

What is Syntaxfor Panda's data frame?

data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. For the row labels, the Index to be used for the resulting frame is Optional Default np. arange(n) if no index is passed. For column labels, the optional default syntax is - np.

Are pandas Dataframes immutable?

All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures.


2 Answers

You should use json_normalize to flatten the dictionary after YAML loads:

pd.io.json.json_normalize(yaml.load(f), 'reviews', 'doc')

  reviewer stars    doc
0     Paul     5  Book1
1      Sam     2  Book1
2     John     4  Book2
3      Sam     3  Book2
4     Pete     2  Book2
like image 135
cs95 Avatar answered Oct 06 '22 10:10

cs95


Using above now leads to FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead

# lets say the yaml file is test_sample.yml
from pandas import json_normalize
from os import getcwd, path
from yaml import SafeLoader, load

path_to_yaml = path.join(getcwd(), ..., "test_sample.yaml")
with open(path_to_yaml) as yaml_file:
    yaml_contents = load(path_to_file, Loader=SafeLoader)
yaml_df = json_normalize(yaml_contents)
like image 25
Salil Shenoy Avatar answered Oct 06 '22 09:10

Salil Shenoy