I have a list of Dataframes that I am attempting to combine using the concatenation function.
dataframe_lists = [df1, df2, df3]
result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
The full traceback is:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-198-a30c57d465d0> in <module>()
----> 1 result = pd.concat(dataframe_lists, keys = ['one', 'two','three'], ignore_index=True)
2 check(dataframe_lists)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
753 verify_integrity=verify_integrity,
754 copy=copy)
--> 755 return op.get_result()
756
757
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\tools\merge.py in get_result(self)
924
925 new_data = concatenate_block_managers(
--> 926 mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
927 if not self.copy:
928 new_data._consolidate_inplace()
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in <listcomp>(.0)
4061 copy=copy),
4062 placement=placement)
-> 4063 for placement, join_units in concat_plan]
4064
4065 return BlockManager(blocks, axes)
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in concatenate_join_units(join_units, concat_axis, copy)
4150 raise AssertionError("Concatenating join units along axis0")
4151
-> 4152 empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
4153
4154 to_concat = [ju.get_reindexed_values(empty_dtype=empty_dtype,
C:\WinPython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\pandas\core\internals.py in get_empty_dtype_and_na(join_units)
4139 return np.dtype('m8[ns]'), tslib.iNaT
4140 else: # pragma
-> 4141 raise AssertionError("invalid dtype determination in get_concat_dtype")
4142
4143
AssertionError: invalid dtype determination in get_concat_dtype
I believe that the error lies in the fact that one of the data frames is empty. I used the simple function check
to verify and return just the headers of the empty dataframe:
def check(list_of_df):
headers = []
for df in dataframe_lists:
if df.empty is not True:
continue
else:
headers.append(df.columns)
return headers
I am wondering if it is possible to use this function to, if in the case of an empty dataframe, return just that empty dataframe's headers and append it to the concatenated dataframe. The output would be a single row for the headers (and, in the case of a repeating column name, just a single instance of the header (as in the case of the concatenation function). I have two sample data sources, one and two non-empty data sets. Here is an empty dataframe.
I would like to have the resulting concatenate have the column headers...
'AT','AccountNum', 'AcctType', 'Amount', 'City', 'Comment', 'Country','DuplicateAddressFlag', 'FromAccount', 'FromAccountNum', 'FromAccountT','PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
to have an empty dataframe's headers be appended in line with this row (if they are new).
'A', 'AT','AccountNum', 'AcctType', 'Amount', 'B', 'C', 'City', 'Comment', 'Country', 'D', 'DuplicateAddressFlag', 'E', 'F' 'FromAccount', 'FromAccountNum', 'FromAccountT', 'G', 'PN', 'PriorCity', 'PriorCountry', 'PriorState', 'PriorStreetAddress','PriorStreetAddress2', 'PriorZip', 'RTID', 'State', 'Street1','Street2', 'Timestamp', 'ToAccount', 'ToAccountNum', 'ToAccountT', 'TransferAmount', 'TransferMade', 'TransferTimestamp', 'Ttype', 'WA','WC', 'Zip'
I welcome feedback on the best method to do this.
As the answer below details, this is a rather unexpected result:
Unfortunately, due to the sensitivity of this material, I cannot share the actual data. Leading up to what is presented in the gist is the following:
A= data[data['RRT'] == 'A'] #Select just the columns with from the dataframe "data"
B= data[data['RRT'] == 'B']
C= data[data['RRT'] == 'C']
D= data[data['RRT'] == 'D']
For each of the new data frames I then apply this logic:
for column_name, column in A.transpose().iterrows():
AColumns= A[['ANum','RTID', 'Description','Type','Status', 'AD', 'CD', 'OD', 'RCD']] #get select columns indexed with dataframe, "A"
When I perform the bound method on an empty dataframe A:
AColumns.count
This is the output:
<bound method DataFrame.count of Empty DataFrame
Columns: [ANum,RTID, Description,Type,Status, AD, CD, OD, RCD]
Index: []>
Finally, I imported the CSV with the following:
data=pd.read_csv('Merged_Success2.csv', dtype=str, error_bad_lines = False, iterator=True, chunksize=1000)
data=pd.concat([chunk for chunk in data], ignore_index=True)
I am not certain what else I can provide. The concatenation method works with all other data frames that are needed to meet a requirement. I have also looked at the Pandas internals.py and the full trace. Either I have too many columns with NaN, duplicate column names or mixed dtypes (the latter being the least likely culprit).
Thank you again for your guidance.
During one of our projects we experienced the same error. After debugging we found the problem. One of our dataframes had 2 columns with the same name. After renaming one of the columns our problem was solved.
This often means that you have two columns with the same names in one of the dataframes.
You can check if this is the case by looking at the output of
len(df.columns) > len(np.unique(df.columns))
for each dataframe df
that you are trying to concatenate.
You can identify the culprit columns through using Counter
see for example:
from collections import Counter
duplicates = [c for c in Counter(df.columns).items() if c[1] > 1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With