I am trying to read an Excel spreadsheet that is unformatted using Pandas. There are multiple tables within a single sheet and I want to convert these tables into dataframes. Since it is not already "indexed" in the traditional way, there are no meaningful column or row indices. Is there a way to search for a specific value and get the row, column where that is? For example, say I want to get a row, column number for all cells that contain the string "Title".
I have already tried things like DataFrame.filter but that only works if there are row and column indices.
To tell pandas to start reading an Excel sheet from a specific row, use the argument header = 0-indexed row where to start reading. By default, header=0, and the first such row is used to give the names of the data frame columns. To skip rows at the end of a sheet, use skipfooter = number of rows to skip.
Create a df with NaN where your_value is not found.
Drop all rows that don't contain the value.
Drop all columns that don't contain the value
a = df.where(df=='your_value').dropna(how='all').dropna(axis=1)
To get the row(s)
a.index
To get the column(s)
a.columns
You can do some long and hard to read list comprehension:
# assume this df and that we are looking for 'abc'
df = pd.DataFrame({'col':['abc', 'def','wert','abc'], 'col2':['asdf', 'abc', 'sdfg', 'def']})
[(df[col][df[col].eq('abc')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].eq('abc')].index))]
out:
[(0, 0), (3, 0), (1, 1)]
I should note that this is (index value, column location)
you can also change .eq()
to str.contains()
if you are looking for any strings that contains a certain value:
[(df[col][df[col].str.contains('ab')].index[i], df.columns.get_loc(col)) for col in df.columns for i in range(len(df[col][df[col].str.contains('ab')].index))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With