I have a list of items and want to clean the data with certain conditions and the output is a dataframe. Here's the list:
[
"Onion per Pack|500 g|Rp18,100|Rp3,700 / 100 g|Add to cart",
"Shallot per Pack|250 g|-|49%|Rp22,300|Rp11,300|Rp4,600 / 100 g|Add to cart",
"Spring Onion per Pack|250 g|Rp7,000|Rp2,800 / 100 g|Add to cart",
"Green Beans per Pack|250 g|Rp5,900|Rp2,400 / 100 g|Add to cart",
]
into
| name | unit | discount | price | unit price |
|---|---|---|---|---|
| Onion per Pack | 500 g | Rp18,100 | Rp3,700 / 100 g | |
| Shallot per Pack | 250 g | 49% | Rp22,300 | Rp11,300 |
| Spring Onion per Pack | 250 g | Rp7,000 | Rp2,800 / 100 g | |
| Green Beans per Pack | 250 g | Rp5,900 | Rp2,400 / 100 g |
Currently my code is:
datas = pd.DataFrame()
for i in item:
long = len(i.split("|"))
if long == 5:
data = {"name": i.split("|")[0]
"unit": i.split("|")[2]
"discount": ""
"price": i.split("|")[3]
"unit price": i.split("|")[4]}
dat = pd.DataFrame(data)
datas.append(dat)
else:
data = {"name": i.split("|")[0]
"unit": i.split("|")[2]
"discount": i.split("|")[4]
"price": i.split("|")[6]
"unit price": i.split("|")[7]}
dat = pd.DataFrame(data)
datas.append(dat)
Is there a more efficient way? A shorter way to achieve this?
Once the source data has been cleaned (preferably by the provider) and each field is defined - ensuring an equal number of fields through the dataset - the following very simple approach can be used to populate the DataFrame:
Data:
cols = ['name', 'unit', 'discount', 'price', 'unit_price', 'other']
# Fields are defined by placing a 'double delimiter' indicating empty fields.
items = ["Onion per Pack|500 g||Rp18,100|Rp3,700 / 100 g|Add to cart",
"Shallot per Pack|250 g|49%|Rp22,300|Rp4,600 / 100 g|Add to cart",
"Spring Onion per Pack|250 g||Rp7,000|Rp2,800 / 100 g|Add to cart",
"Green Beans per Pack|250 g||Rp5,900|Rp2,400 / 100 g|Add to cart"]
Population:
The cleaned source data can be populated directly into the DataFrame via the data parameter. In the case below, a 'generator expression' is used to iterate the dataset efficiently and split on the field delimiter.
The next statement removed the additional column, which is not to be included in the output.
df = pd.DataFrame(data=(i.split('|') for i in items), columns=cols)
df.drop('other', axis=1, inplace=True)
Output:
name unit discount price unit_price
0 Onion per Pack 500 g Rp18,100 Rp3,700 / 100 g
1 Shallot per Pack 250 g 49% Rp22,300 Rp4,600 / 100 g
2 Spring Onion per Pack 250 g Rp7,000 Rp2,800 / 100 g
3 Green Beans per Pack 250 g Rp5,900 Rp2,400 / 100 g
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With