finding nested columns in pandas dataframe

Tags:

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a (JSON) string.

So for the columns I've identified I am doing:

df[column] = df[column].astype(str)

However, I'm not sure which columns are nested and which are not. When I write with parquet, I see this message:

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

This signals that I failed to convert one of my columns from a nested object to a string. But which column is to blame? How do I find out?

When I print the .dtypes of my pandas dataframe, I can't differentiate between string and nested values because both show up as object.

EDIT: the error gives a hint as to the nested column by showing struct details, but this quite time consuming to debug. Also it only prints the first error and if you have multiple nested columns this can get quite annoying

585

asked Apr 13 '20 20:04

Daniel Kats

3 Answers

Casting nested structure to string

If I understand your question correctly, you want to serialize those nested Python objects (list, dict) within df to JSON strings and leave other elements unchanged. It is better to write your own casting method:

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

In case the dataframe is huge, using astype(str) would be faster.

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

This approach has a performance benefit thanks to the short-circuit evaluation in the call to any(...). It will return immediately once hitting the first nested object in the column and will not waste time checking the rest. If any of the "Dtype Introspection" methods fits your data, using that would be even faster.

Check the latest version of pyarrow

I assume that those nested structures need to be converted to string only because that they would cause error in pyarrow.parquet.write_table. Maybe you don't need to convert it at all, because the issue of handling nested columns in pyarrow has been reportedly solved recently (29th Mar 2020, ver 0.17.0). But the support may be problematic and under active discussion.

110

answered Nov 09 '22 03:11

gdlmx

Using a general utility function like infer_dtype() in pandas you can determine if the column is nested or not.

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

If you are targeting specific data types, then see Dtype Introspection

answered Nov 09 '22 03:11

Saurabh P Bhandari

I had a simialir problem when working with Pyspark and a streaming Dataset, some columns were nested and some were not.

Given that your dataframe may look like this:

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A

We can stack your dataframe and use apply with type to get the type of each column and pass it to a dictionary.

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

with this we can use a function to return a tuple of the nested & unnested columns.

Function

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns

In Action.

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]

answered Nov 09 '22 03:11

Umar.H

Related questions
                            
                                Error pickling a `matlab` object in joblib `Parallel` context
                            
                                What does distutils do with the "requires" metadata?
                            
                                Auto Import and Refactor (Move) function from one file to another in vscode
                            
                                dataclasses: how to ignore None values using asdict()?
                            
                                Is there a numerically optimal order of matrix multiplication?
                            
                                How to configure pytest to avoid collection failure on missing imports?
                            
                                Different ways of getting Ethereum txpool pending transactions at Infura node via Web3.py
                            
                                How to return dictonary or json if I use psycopg2?
                            
                                'dict' object has no attribute 'pk' when using Django bulk_create() function
                            
                                Keras predict() returns a better accuracy than evaluate()
                            
                                Is it possible to load a pretrained Pytorch model from a GCS bucket URL without first persisting locally?
                            
                                How to write your own async/awaitable coroutine function in Python?
                            
                                AWS Lambda python: .so module: ModuleNotFoundError: No module named 'regex._regex' when in subshell
                            
                                Django - Template rendering performance (I think) how to check if enabling LocMemCache is working?
                            
                                BertTokenizer - when encoding and decoding sequences extra spaces appear
                            
                                Why is numpy random seed not remaining fixed but RandomState is when run in parallel?
                            
                                How to "load" dependent drop down upon page load?
                            
                                Set content type when uploading to Azure Blob Storage
                            
                                100% classifier accuracy after using train_test_split
                            
                                Pytorch Autograd gives different gradients when using .clamp instead of torch.relu

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

finding nested columns in pandas dataframe

Tags:

python

python-3.x

pandas

pyarrow

Daniel Kats

People also ask

3 Answers

Casting nested structure to string

Check the latest version of pyarrow

gdlmx

Saurabh P Bhandari

Function

In Action.

Umar.H

Recent Activity

Donate For Us