I am trying to get data from a YAML file into a Pandas DataFrame. Take the following example data.yml
:
---
- doc: "Book1"
reviews:
- reviewer: "Paul"
stars: "5"
- reviewer: "Sam"
stars: "2"
- doc: "Book2"
reviews:
- reviewer: "John"
stars: "4"
- reviewer: "Sam"
stars: "3"
- reviewer: "Pete"
stars: "2"
...
The desired DataFrame would look like this:
doc reviews.reviewer reviews.stars
0 Book1 Paul 5
1 Book1 Sam 2
2 Book2 John 4
3 Book2 Sam 3
4 Book2 Pete 2
I've tried feeding the YAML data to Pandas different ways (like with open('data.yml') as f: data = pd.DataFrame(yaml.load(f))
), but the cells always contain the nested dicts. This solution works for general JSON data, but it's quite a bit of code and it seems like a simpler solution for YAML might exist.
Is there a built-in or Pythonic way to denormalize YAML for conversion to a Pandas Dataframe in this way?
Numba can be used in 2 ways with pandas: Specify the engine="numba" keyword in select pandas methods. Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or Dataframe (using to_numpy() ) into the function.
data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. For the row labels, the Index to be used for the resulting frame is Optional Default np. arange(n) if no index is passed. For column labels, the optional default syntax is - np.
All Pandas data structures are value mutable (can be changed) and except Series all are size mutable. Series is size immutable. Note − DataFrame is widely used and one of the most important data structures.
You should use json_normalize
to flatten the dictionary after YAML loads:
pd.io.json.json_normalize(yaml.load(f), 'reviews', 'doc')
reviewer stars doc
0 Paul 5 Book1
1 Sam 2 Book1
2 John 4 Book2
3 Sam 3 Book2
4 Pete 2 Book2
Using above now leads to FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead
# lets say the yaml file is test_sample.yml
from pandas import json_normalize
from os import getcwd, path
from yaml import SafeLoader, load
path_to_yaml = path.join(getcwd(), ..., "test_sample.yaml")
with open(path_to_yaml) as yaml_file:
yaml_contents = load(path_to_file, Loader=SafeLoader)
yaml_df = json_normalize(yaml_contents)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With