I am fetching data from a big DB, and I am trying to represent. I want to use pandas to trim it however, as the SQL DB query uses several joins and the returned table actually has a lot of duplicate cells. I found the following question:
Python Pandas Remove Duplicate Cells - Keep the rows
This does the job for me, however with some bugs. I think I might be using it wrong. Can someone explain to me a bit how it works?
I my case (see the spinet) I use a column called Product as the main axis of the loc (if this is correct to be said like that). Afterwards I am trying to filter the duplicate cells in ALL the other columns from the dataframe and replace with empty string:
df.loc[df.Product.duplicated(), ['Type', 'ID', 'Device', 'MID',
'Technical Description', 'Zx', 'Package', 'Connector',
'Wavelength', 'Chip', 'Code', 'FType',
'Orientation']] = ' '
This removes cells from the Device column which actually are not duplicates. I really can't get it. I apologize if this is stupid!
Thanks a lot!
EDIT: Here is the code for the dataframe and DB connections:
with connect(database='measurements') as connSQL: # Use a context handler for the connection
pdQuery = pd.read_sql_query(''' SQL CODE ''', connSQL)
df = pd.DataFrame(pdQuery, columns=['Product', 'Type', 'ID', 'Device', 'MID',
'Technical Description', 'Zx', 'Package', 'Connector',
'Wavelength', 'Chip', 'Code', 'FType',
'Orientation'])
Output for only 1 product selected:
Without cell filtering:

With cell filtering: The code described above.

Is there a way to selectively remove cells? Should I apply this to several columns in order to leave only the unique values per column?
If need replace duplicated values to empty strings for each column separately use DataFrame.mask with Series.duplicated per each column by DataFrame.apply:
df = df.mask(df.apply(lambda x: x.duplicated()), '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With