I have input data in a flattened file. I want to normalize this data, by splitting it into tables. Can I do that neatly with pandas
- that is, by reading the flattened data into a DataFrame
instance, and then applying some functions to obtain the resulting DataFrame
instances?
Example:
Data is given to me on disk in the form of a CSV file like this:
ItemId ClientId PriceQuoted ItemDescription
1 1 10 scroll of Sneak
1 2 12 scroll of Sneak
1 3 13 scroll of Sneak
2 2 2500 scroll of Invisible
2 4 2200 scroll of Invisible
I want to create two DataFrames:
ItemId ItemDescription
1 scroll of Sneak
2 scroll of Invisibile
and
ItemId ClientId PriceQuoted
1 1 10
1 2 12
1 3 13
2 2 2500
2 4 2200
If pandas
only has a good solution for the simplest case (normalization results in 2 tables with many-to-one relationship - just like in the above example), it might be enough for my current needs. I may need a more general solution in the future, however.
In [30]: df = pandas.read_csv('foo1.csv', sep='[\s]{2,}')
In [30]: df
Out[30]:
ItemId ClientId PriceQuoted ItemDescription
0 1 1 10 scroll of Sneak
1 1 2 12 scroll of Sneak
2 1 3 13 scroll of Sneak
3 2 2 2500 scroll of Invisible
4 2 4 2200 scroll of Invisible
In [31]: df1 = df[['ItemId', 'ItemDescription']].drop_duplicates().set_index('ItemId')
In [32]: df1
Out[32]:
ItemDescription
ItemId
1 scroll of Sneak
2 scroll of Invisible
In [33]: df2 = df[['ItemId', 'ClientId', 'PriceQuoted']]
In [34]: df2
Out[34]:
ItemId ClientId PriceQuoted
0 1 1 10
1 1 2 12
2 1 3 13
3 2 2 2500
4 2 4 2200
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With