I have created a large Dataframe by pulling data from an Azure database. The construction of the dataframe wasn't simple as I had to do it in parts, using the concat function to add new columns to the data set as they were pulled from the database.
This worked fine, however I am indexing by entry date and when concatenating I sometimes get two data rows with the same index. Is it possible for me to merge lines with the same index? I have searched online for solutions but I always come across examples trying to merge two separate dataframes instead of merging rows within the same dataframe.
Col1 Col2
2015-10-27 22:22:31 1400
2015-10-27 22:22:31 50.5
Col1 Col2
2015-10-27 22:22:31 1400 50.5
I have tried using the groupby function on index but that just messed up. Most of the data columns disappeared and a few very large numbers were spat out.
The data is in this sort of format, except with many more columns and is generally quite sparse!
Col1 Col2 ... Col_n-1 Col_n
2015-10-27 21:15:60+0 1220
2015-10-27 21:25:4+0 1420
2015-10-27 21:28:8+0 1410
2015-10-27 21:37:10+0 51.5
2015-10-27 21:37:11+0 1500
2015-10-27 21:46:14+0 51
2015-10-27 21:46:15+0 1390
2015-10-27 21:55:19+0 1370
2015-10-27 22:04:24+0 1450
2015-10-27 22:13:28+0 1350
2015-10-27 22:22:31+0 1400
2015-10-27 22:22:31+0 50.5
2015-10-27 22:25:33+0 1300
2015-10-27 22:29:42+0 ... 1900
2015-10-27 22:29:42+0 63
2015-10-27 22:34:36+0 1280
concat() to Merge Two DataFrames by Index. You can concatenate two DataFrames by using pandas. concat() method by setting axis=1 , and by default, pd. concat is a row-wise outer join.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
We can use the concat function in pandas to append either columns or rows from one DataFrame to another.
Building up on @EdChum 's answer, it is also possible to use the min_count
parameter of groupBy.sum
to manage NaN values in different ways. Let's say we have an additional row to the example:
Col1 Col2
2015-10-27 22:22:31 1400 NaN
2015-10-27 22:22:31 NaN 50.5
2022-08-02 16:00:00 1600 NaN
then,
In [184]:
df.groupby('index').sum(min_count=1)
Out[184]:
Col1 Col2
index
2015-10-27 22:22:31 1400 50.5
2022-08-02 16:00:00 1600 NaN
Using min_count=0
will output 0 instead of NaN values.
You can groupby
on your index and call sum
:
In [184]:
df.groupby(level=0).sum()
Out[184]:
Col1 Col2
index
2015-10-27 22:22:31 1400 50.5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With