Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python Pandas: pivot only certain columns in the DataFrame while keeping others

I am trying to re-arrange a DataFrame that I automatically read in from a json using Pandas. I've searched but have had no success.

I have the following json (saved as a string for copy/paste convenience) with a bunch of json objects/dictionarys under the tag 'value'

json_str = '''{"preferred_timestamp": "internal_timestamp",
    "internal_timestamp": 3606765503.684,
    "stream_name": "ctdpf_j_cspp_instrument",
    "values": [{
        "value_id": "temperature",
        "value": 9.8319
    }, {
        "value_id": "conductivity",
        "value": 3.58847
    }, {
        "value_id": "pressure",
        "value": 22.963

I use the function 'json_normalize' in order to load the json into a flattened Pandas dataframe.

>>> from pandas.io.json import json_normalize
>>> import simplejson as json
>>> df = json_normalize(json.loads(json_str), 'values', ['preferred_timestamp', 'stream_name', 'internal_timestamp'])
>>> df
      value      value_id preferred_timestamp  internal_timestamp  \
0   9.83190   temperature  internal_timestamp        3.606766e+09   
1   3.58847  conductivity  internal_timestamp        3.606766e+09   
2  22.96300      pressure  internal_timestamp        3.606766e+09   
3  32.89470      salinity  internal_timestamp        3.606766e+09   

0  ctdpf_j_cspp_instrument  
1  ctdpf_j_cspp_instrument  
2  ctdpf_j_cspp_instrument  
3  ctdpf_j_cspp_instrument  

Here is where I am stuck. I want to take the value and value_id columns and pivot these into new columns based off of value_id.

I want the dataframe to look like the following:

stream_name              preferred_timestamp  internal_timestamp  conductivity  pressure  salinity  temperature    
ctdpf_j_cspp_instrument  internal_timestamp   3.606766e+09        3.58847       22.96300  32.89470  9.83190

I've tried both the pivot and pivot_table Pandas functions and even tried to manually pivot the tables by using 'set_index' and 'stack' but it's not quite how I want it.

>>> df.pivot_table(values='value', index=['stream_name', 'preferred_timestamp', 'internal_timestamp', 'value_id'])
stream_name              preferred_timestamp  internal_timestamp  value_id    
ctdpf_j_cspp_instrument  internal_timestamp   3.606766e+09        conductivity     3.58847
                                                                  pressure        22.96300
                                                                  salinity        32.89470
                                                                  temperature      9.83190
Name: value, dtype: float64

This is close, but it didn't seem to pivot the values in 'value_id' into separate columns.


>>> df.pivot('stream_name', 'value_id', 'value')
value_id                 conductivity  pressure  salinity  temperature
ctdpf_j_cspp_instrument       3.58847    22.963   32.8947       9.8319

Close again, but it lacks the other columns that I want to be associated with this line.

I'm stuck here. Is there an elegant way of doing this or should I split the DataFrames and re-merge them to how I want?

like image 590
naja Avatar asked Mar 15 '16 18:03


People also ask

How do I select only certain columns in pandas?

To select a single column, use square brackets [] with the column name of the column of interest.

How do I make a data frame with only certain columns?

You can create a new DataFrame of a specific column by using DataFrame. assign() method. The assign() method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones.

How do you exclude columns from a DataFrame?

We can exclude one column from the pandas dataframe by using the loc function. This function removes the column based on the location. Here we will be using the loc() function with the given data frame to exclude columns with name,city, and cost in python.

How do you drop all columns except some in pandas?

Select All Except One Column Using drop() Method in pandas In order to remove columns use axis=1 or columns param. For example df. drop("Discount",axis=1) removes Discount column by kepping all other columns untouched. This gives you a DataFrame with all columns with out one unwanted column.

1 Answers

Your first attempt was nearly correct, just use columns='value_id' instead of including it in the index.

# Perform the pivot.
df = df.pivot_table(
    index=['stream_name', 'preferred_timestamp', 'internal_timestamp'],

# Formatting.
df.columns.name = None

This isn't an issue in your example data, but keep in mind that pivot_table will aggregate values if multiple values are pivoted to the same position (taking the mean by default).

like image 192
root Avatar answered Oct 05 '22 06:10
