I am trying to re-arrange a DataFrame that I automatically read in from a json using Pandas. I've searched but have had no success.
I have the following json (saved as a string for copy/paste convenience) with a bunch of json objects/dictionarys under the tag 'value'
json_str = '''{"preferred_timestamp": "internal_timestamp",
"internal_timestamp": 3606765503.684,
"stream_name": "ctdpf_j_cspp_instrument",
"values": [{
"value_id": "temperature",
"value": 9.8319
}, {
"value_id": "conductivity",
"value": 3.58847
}, {
"value_id": "pressure",
"value": 22.963
}]
}'''
I use the function 'json_normalize' in order to load the json into a flattened Pandas dataframe.
>>> from pandas.io.json import json_normalize
>>> import simplejson as json
>>> df = json_normalize(json.loads(json_str), 'values', ['preferred_timestamp', 'stream_name', 'internal_timestamp'])
>>> df
value value_id preferred_timestamp internal_timestamp \
0 9.83190 temperature internal_timestamp 3.606766e+09
1 3.58847 conductivity internal_timestamp 3.606766e+09
2 22.96300 pressure internal_timestamp 3.606766e+09
3 32.89470 salinity internal_timestamp 3.606766e+09
stream_name
0 ctdpf_j_cspp_instrument
1 ctdpf_j_cspp_instrument
2 ctdpf_j_cspp_instrument
3 ctdpf_j_cspp_instrument
Here is where I am stuck. I want to take the value and value_id columns and pivot these into new columns based off of value_id.
I want the dataframe to look like the following:
stream_name preferred_timestamp internal_timestamp conductivity pressure salinity temperature
ctdpf_j_cspp_instrument internal_timestamp 3.606766e+09 3.58847 22.96300 32.89470 9.83190
I've tried both the pivot and pivot_table Pandas functions and even tried to manually pivot the tables by using 'set_index' and 'stack' but it's not quite how I want it.
>>> df.pivot_table(values='value', index=['stream_name', 'preferred_timestamp', 'internal_timestamp', 'value_id'])
stream_name preferred_timestamp internal_timestamp value_id
ctdpf_j_cspp_instrument internal_timestamp 3.606766e+09 conductivity 3.58847
pressure 22.96300
salinity 32.89470
temperature 9.83190
Name: value, dtype: float64
This is close, but it didn't seem to pivot the values in 'value_id' into separate columns.
and
>>> df.pivot('stream_name', 'value_id', 'value')
value_id conductivity pressure salinity temperature
stream_name
ctdpf_j_cspp_instrument 3.58847 22.963 32.8947 9.8319
Close again, but it lacks the other columns that I want to be associated with this line.
I'm stuck here. Is there an elegant way of doing this or should I split the DataFrames and re-merge them to how I want?
To select a single column, use square brackets [] with the column name of the column of interest.
You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.
We can exclude one column from the pandas dataframe by using the loc function. This function removes the column based on the location. Here we will be using the loc() function with the given data frame to exclude columns with name,city, and cost in python.
Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.
Your first attempt was nearly correct, just use columns='value_id'
instead of including it in the index.
# Perform the pivot.
df = df.pivot_table(
values='value',
index=['stream_name', 'preferred_timestamp', 'internal_timestamp'],
columns='value_id'
)
# Formatting.
df.reset_index(inplace=True)
df.columns.name = None
This isn't an issue in your example data, but keep in mind that pivot_table
will aggregate values if multiple values are pivoted to the same position (taking the mean by default).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With