I am creating a dataframe
from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a
.
So after forming dataframe
and when I do df['a']
which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None
. How can I get that column?
Pandas, however, can be tricked into allowing duplicate column names. Duplicate column names are a problem if you plan to transfer your data set to another statistical language. They're also a problem because it will cause unanticipated and sometimes difficult to debug problems in Python.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
In Pandas, we have the freedom to add columns in the data frame whenever needed. There are multiple ways to add columns to pandas dataframe.
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a'
columns get named 'a.0'...'a.N'
as specified above.
if you used mangle_dupe_cols=False
, importing this csv
would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With