Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solution for AssertionError: invalid dtype determination in get_concat_dtype when concatenating operation on list of Dataframes

Tags:

python

pandas

csv

I have a list of Dataframes that I am attempting to combine using the concatenation function.

dataframe_lists = [df1, df2, df3]

result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)

The full traceback is:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
      2 check(dataframe_lists)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    753                        verify_integrity=verify_integrity,
    754                        copy=copy)
--> 755     return op.get_result()
    756 
    757 

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
    924 
    925             new_data = concatenate_block_managers(
--> 926                 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
    927             if not self.copy:
    928                 new_data._consolidate_inplace()

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
   4061                                                 copy=copy),
   4062                          placement=placement)
-> 4063               for placement, join_units in concat_plan]
   4064 
   4065     return BlockManager(blocks, axes)

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
   4150         raise AssertionError("Concatenating join units along axis0")
   4151 
-> 4152     empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
   4153 
   4154     to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,

C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
   4139         return np.dtype('m8[ns]'), tslib.iNaT
   4140     else:  # pragma
-> 4141         raise AssertionError("invalid dtype determination in get_concat_dtype")
   4142 
   4143 

AssertionError: invalid dtype determination in get_concat_dtype

I believe that the error lies in the fact that one of the data frames is empty. I used the simple function check to verify and return just the headers of the empty dataframe:

  def check(list_of_df):

    headers = []
    for df in dataframe_lists:
        if df.empty is not True:
            continue
        else:  
            headers.append(df.columns)

    return headers

I am wondering if it is possible to use this function to, if in the case of an empty dataframe, return just that empty dataframe's headers and append it to the concatenated dataframe. The output would be a single row for the headers (and, in the case of a repeating column name, just a single instance of the header (as in the case of the concatenation function). I have two sample data sources, one and two non-empty data sets. Here is an empty dataframe.

I would like to have the resulting concatenate have the column headers...

 'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

to have an empty dataframe's headers be appended in line with this row (if they are new).

 'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'

I welcome feedback on the best method to do this.

As the answer below details, this is a rather unexpected result:

Unfortunately, due to the sensitivity of this material, I cannot share the actual data. Leading up to what is presented in the gist is the following:

A= data[data['RRT'] == 'A'] #Select just the columns with  from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']

For each of the new data frames I then apply this logic:

for column_name, column in A.transpose().iterrows():
    AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']]  #get select columns indexed with dataframe, "A"

When I perform the bound method on an empty dataframe A:

AColumns.count

This is the output:

<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>

Finally, I imported the CSV with the following:

data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True,  chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)

I am not certain what else I can provide. The concatenation method works with all other data frames that are needed to meet a requirement. I have also looked at the Pandas internals.py and the full trace. Either I have too many columns with NaN, duplicate column names or mixed dtypes (the latter being the least likely culprit).

Thank you again for your guidance.

like image 200
ahlusar1989 Avatar asked Sep 09 '15 20:09

ahlusar1989


2 Answers

During one of our projects we experienced the same error. After debugging we found the problem. One of our dataframes had 2 columns with the same name. After renaming one of the columns our problem was solved.

like image 127
remi Avatar answered Sep 28 '22 06:09

remi


This often means that you have two columns with the same names in one of the dataframes.

You can check if this is the case by looking at the output of

len(df.columns) > len(np.unique(df.columns))

for each dataframe df that you are trying to concatenate.

You can identify the culprit columns through using Counter see for example:

from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]
like image 37
Abramodj Avatar answered Sep 28 '22 06:09

Abramodj