Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
You can use the Series.to_list
method.
For example:
import pandas as pd
df = pd.DataFrame({'a': [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9],
'b': [3, 5, 6, 2, 4, 6, 7, 8, 7, 8, 9]})
print(df['a'].to_list())
Output:
[1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]
To drop duplicates you can do one of the following:
>>> df['a'].drop_duplicates().to_list()
[1, 3, 5, 7, 4, 6, 8, 9]
>>> list(set(df['a'])) # as pointed out by EdChum
[1, 3, 4, 5, 6, 7, 8, 9]
I'd like to clarify a few things:
pandas.Series.tolist()
. I'm not sure why the top voted answer
leads off with using pandas.Series.values.tolist()
since as far as I can tell, it adds syntax/confusion with no added benefit.tst[lookupValue][['SomeCol']]
is a dataframe (as stated in the
question), not a series (as stated in a comment to the question). This is because tst[lookupValue]
is a dataframe, and slicing it with [['SomeCol']]
asks for
a list of columns (that list that happens to have a length of 1), resulting in a dataframe being returned. If you
remove the extra set of brackets, as in
tst[lookupValue]['SomeCol']
, then you are asking for just that one
column rather than a list of columns, and thus you get a series back.pandas.Series.tolist()
, so you should
definitely skip the second set of brackets in this case. FYI, if you
ever end up with a one-column dataframe that isn't easily avoidable
like this, you can use pandas.DataFrame.squeeze()
to convert it to
a series.tst[lookupValue]['SomeCol']
is getting a subset of a particular column via
chained slicing. It slices once to get a dataframe with only certain rows
left, and then it slices again to get a certain column. You can get
away with it here since you are just reading, not writing, but
the proper way to do it is tst.loc[lookupValue, 'SomeCol']
(which returns a series).ID = tst.loc[tst['SomeCol'] == 'SomeValue', 'SomeCol'].tolist()
Demo Code:
import pandas as pd
df = pd.DataFrame({'colA':[1,2,1],
'colB':[4,5,6]})
filter_value = 1
print "df"
print df
print type(df)
rows_to_keep = df['colA'] == filter_value
print "\ndf['colA'] == filter_value"
print rows_to_keep
print type(rows_to_keep)
result = df[rows_to_keep]['colB']
print "\ndf[rows_to_keep]['colB']"
print result
print type(result)
result = df[rows_to_keep][['colB']]
print "\ndf[rows_to_keep][['colB']]"
print result
print type(result)
result = df[rows_to_keep][['colB']].squeeze()
print "\ndf[rows_to_keep][['colB']].squeeze()"
print result
print type(result)
result = df.loc[rows_to_keep, 'colB']
print "\ndf.loc[rows_to_keep, 'colB']"
print result
print type(result)
result = df.loc[df['colA'] == filter_value, 'colB']
print "\ndf.loc[df['colA'] == filter_value, 'colB']"
print result
print type(result)
ID = df.loc[rows_to_keep, 'colB'].tolist()
print "\ndf.loc[rows_to_keep, 'colB'].tolist()"
print ID
print type(ID)
ID = df.loc[df['colA'] == filter_value, 'colB'].tolist()
print "\ndf.loc[df['colA'] == filter_value, 'colB'].tolist()"
print ID
print type(ID)
Result:
df
colA colB
0 1 4
1 2 5
2 1 6
<class 'pandas.core.frame.DataFrame'>
df['colA'] == filter_value
0 True
1 False
2 True
Name: colA, dtype: bool
<class 'pandas.core.series.Series'>
df[rows_to_keep]['colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df[rows_to_keep][['colB']]
colB
0 4
2 6
<class 'pandas.core.frame.DataFrame'>
df[rows_to_keep][['colB']].squeeze()
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[rows_to_keep, 'colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[df['colA'] == filter_value, 'colB']
0 4
2 6
Name: colB, dtype: int64
<class 'pandas.core.series.Series'>
df.loc[rows_to_keep, 'colB'].tolist()
[4, 6]
<type 'list'>
df.loc[df['colA'] == filter_value, 'colB'].tolist()
[4, 6]
<type 'list'>
You can use pandas.Series.tolist
e.g.:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
Run:
>>> df['a'].tolist()
You will get
>>> [1, 2, 3]
The above solution is good if all the data is of same dtype. Numpy arrays are homogeneous containers. When you do df.values
the output is an numpy array
. So if the data has int
and float
in it then output will either have int
or float
and the columns will loose their original dtype.
Consider df
a b
0 1 4
1 2 5
2 3 6
a float64
b int64
So if you want to keep original dtype, you can do something like
row_list = df.to_csv(None, header=False, index=False).split('\n')
this will return each row as a string.
['1.0,4', '2.0,5', '3.0,6', '']
Then split each row to get list of list. Each element after splitting is a unicode. We need to convert it required datatype.
def f(row_str):
row_list = row_str.split(',')
return [float(row_list[0]), int(row_list[1])]
df_list_of_list = map(f, row_list[:-1])
[[1.0, 4], [2.0, 5], [3.0, 6]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With