I'm trying to unpivot my data using pd.melt
but no success so far. Each row is a business, and the data contains information about the business and multiple reviews. I want my data to have every review as a row.
My first 150 columns are in groups of 15, each group column name shares the same pattern reviews/n/
for 0 < n < 9
. (reviews/0/text
, reviews/0/date
, ... , reviews/9/date
).
The next 65 columns in the dataframe include more data about the business (e.g. business_id
, address
) that should remain as id_variables.
My current data looks like this:
business_id | address | reviews/0/date | reviews/0/text | reviews/1/date | reviews/1/text |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | "abc" | 2/2/1995 | "def" |
and my new dataframe should have every review as a row instead of every business, and look like this:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | "abc" |
12345 | 01 street | 1 | 2/2/1995 | "def" |
I tried using pd.melt
but could not succeed in making code that produced something valuable to me.
You can use pandas.wide_to_long()
to do what you want.
However, you will need to rename your columns from the pattern reviews/N/COL
to reviews/COL/N
(or something similar) first, as wide_to_long()
can only unpivot based on prefixes, whereas in your column names, you have a prefix and a suffix.
You could do this manually or e.g. using the re
module and an appropriate regex:
df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
After that, your data should look like this (note the changed colnames):
business_id | address | review_date0 | review_text0 | review_date1 | review_text1 |
---|---|---|---|---|---|
12345 | 01 street | 1/1/1990 | abc | 2/2/1995 | def |
Now you can use pandas.wide_to_long()
and use the stubnames
parameter to specify the prefix of the columns that should be grouped when you unpivot.
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number')
Finally, call .reset_index()
to achieve the result you asked for.
Full example:
import re
import pandas as pd
df = pd.DataFrame({'business_id': 12345,
'address': '01 street',
'reviews/0/date': '1/1/1990',
'reviews/0/text': 'abc',
'reviews/1/date': '2/2/1995',
'reviews/1/text': 'def'}, index = [0])
df = df.rename(columns=lambda x: re.sub('reviews/(\d)/(.*)', r'review_\2\1', x))
df = pd.wide_to_long(df,
stubnames=['review_date','review_text'],
i=['business_id', 'address'],
j='review_number').reset_index()
Result:
business_id | address | review_number | review_date | review_text |
---|---|---|---|---|
12345 | 01 street | 0 | 1/1/1990 | abc |
12345 | 01 street | 1 | 2/2/1995 | def |
You can get the names of all the non-review columns.
columns = df.columns[~df.columns.str.match(r'reviews/\d+/')]
>>> columns
Index(['address', 'business_id'], dtype='object')
And use those to .melt()
df = df.melt(columns)
df['review_number'] = df['variable'].str.extract(r'reviews/(\d+)')
df['variable'] = df['variable'].str.replace(r'reviews/\d+/', 'review_')
>>> df
address business_id variable value review_number
0 street 1 review_date 1990-01-01 00:00:00 0
1 street 1 review_text "abc" 0
2 street 1 review_date 1995-02-02 00:00:00 1
3 street 1 review_text "def" 1
From there you can .pivot()
>>> df.pivot(index=columns.union(['review_number']).to_list(), columns='variable')
value
variable review_date review_text
address business_id review_number
street 1 0 1990-01-01 00:00:00 "abc"
1 1995-02-02 00:00:00 "def"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With