I have a large dataset with many columns in (compressed) JSON format. I'm trying to convert it to parquet for subsequent processing. Some columns have a nested structure. For now I want to ignore this structure and just write those columns out as a (JSON) string.
So for the columns I've identified I am doing:
df[column] = df[column].astype(str)
However, I'm not sure which columns are nested and which are not. When I write with parquet, I see this message:
<stack trace redacted>
File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>
This signals that I failed to convert one of my columns from a nested object to a string. But which column is to blame? How do I find out?
When I print the .dtypes
of my pandas dataframe, I can't differentiate between string and nested values because both show up as object
.
EDIT: the error gives a hint as to the nested column by showing struct details, but this quite time consuming to debug. Also it only prints the first error and if you have multiple nested columns this can get quite annoying
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.
If I understand your question correctly, you want to serialize those nested Python objects (list, dict) within df
to JSON strings and leave other elements unchanged. It is better to write your own casting method:
def json_serializer(obj):
if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
return json.dumps(obj)
return obj
df = df.applymap(json_serializer)
In case the dataframe is huge, using astype(str)
would be faster.
nested_cols = []
for c in df:
if any(isinstance(obj, [list, dict]) for obj in df[c]):
nested_cols.append(c)
for c in nested_cols:
df[c] = df[c].astype(str) # this convert every element in the column independent of their types
This approach has a performance benefit thanks to the short-circuit evaluation in the call to any(...)
. It will return immediately once hitting the first nested object in the column and will not waste time checking the rest. If any of the "Dtype Introspection" methods fits your data, using that would be even faster.
I assume that those nested structures need to be converted to string only because that they would cause error in pyarrow.parquet.write_table
.
Maybe you don't need to convert it at all, because the issue of handling nested columns in pyarrow has been reportedly solved recently (29th Mar 2020, ver 0.17.0).
But the support may be problematic and under active discussion.
Using a general utility function like infer_dtype()
in pandas you can determine if the column is nested or not.
from pandas.api.types import infer_dtype
for col in df.columns:
if infer_dtype(df[col]) == 'mixed' :
# ‘mixed’ is the catchall for anything that is not otherwise specialized
df[col] = df[col].astype('str')
If you are targeting specific data types, then see Dtype Introspection
I had a simialir problem when working with Pyspark and a streaming Dataset, some columns were nested and some were not.
Given that your dataframe may look like this:
df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
'C' : [((15,25,87),(22,91))],
'D' : 15,
'E' : 'A'
})
print(df)
A \
0 {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}
B C D E
0 [[15, 25, 61], [44, 22, 87], [A, B, 44]] ((15, 25, 87), (22, 91)) 15 A
We can stack your dataframe and use apply
with type
to get the type of each column and pass it to a dictionary.
df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}
with this we can use a function to return a tuple of the nested & unnested columns.
def find_types(dataframe):
col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
return nested_columns,unnested_columns
nested,unested = find_types(df)
df[unested]
D E
0 15 A
print(df[nested])
C A \
0 ((15, 25, 87), (22, 91)) {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}
B
0 [[15, 25, 61], [44, 22, 87], [A, B, 44]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With