Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding nested columns in pandas dataframe

I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a (JSON) string.

So for the columns I've identified I am doing:

df[column] = df[column].astype(str)

However, I'm not sure which columns are nested and which are not. When I write with parquet, I see this message:

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

This signals that I failed to convert one of my columns from a nested object to a string. But which column is to blame? How do I find out?

When I print the .dtypes of my pandas dataframe, I can't differentiate between string and nested values because both show up as object.

EDIT: the error gives a hint as to the nested column by showing struct details, but this quite time consuming to debug. Also it only prints the first error and if you have multiple nested columns this can get quite annoying

like image 585
Daniel Kats Avatar asked Apr 13 '20 20:04

Daniel Kats


People also ask

How do I show specific columns in Pandas DataFrame?

If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .

How do I search a value within a Pandas DataFrame column?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.


3 Answers

Casting nested structure to string

If I understand your question correctly, you want to serialize those nested Python objects (list, dict) within df to JSON strings and leave other elements unchanged. It is better to write your own casting method:

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

In case the dataframe is huge, using astype(str) would be faster.

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

This approach has a performance benefit thanks to the short-circuit evaluation in the call to any(...). It will return immediately once hitting the first nested object in the column and will not waste time checking the rest. If any of the "Dtype Introspection" methods fits your data, using that would be even faster.

Check the latest version of pyarrow

I assume that those nested structures need to be converted to string only because that they would cause error in pyarrow.parquet.write_table. Maybe you don't need to convert it at all, because the issue of handling nested columns in pyarrow has been reportedly solved recently (29th Mar 2020, ver 0.17.0). But the support may be problematic and under active discussion.

like image 110
gdlmx Avatar answered Nov 09 '22 03:11

gdlmx


Using a general utility function like infer_dtype() in pandas you can determine if the column is nested or not.

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

If you are targeting specific data types, then see Dtype Introspection

like image 1
Saurabh P Bhandari Avatar answered Nov 09 '22 03:11

Saurabh P Bhandari


I had a simialir problem when working with Pyspark and a streaming Dataset, some columns were nested and some were not.

Given that your dataframe may look like this:

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A  

We can stack your dataframe and use apply with type to get the type of each column and pass it to a dictionary.

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

with this we can use a function to return a tuple of the nested & unnested columns.


Function

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns
    

In Action.

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  
like image 1
Umar.H Avatar answered Nov 09 '22 03:11

Umar.H