Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas or python equivalent of tidyr complete

I have data that looks like this:

library("tidyverse")

df <- tibble(user = c(1, 1, 2, 3, 3, 3), x = c("a", "b", "a", "a", "c", "d"), y = 1)
df

#    user     x     y
# 1     1     a     1
# 2     1     b     1
# 3     2     a     1
# 4     3     a     1
# 5     3     c     1
# 6     3     d     1

Python format:

import pandas as pd
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1})

I'd like to "complete" the data frame so that every user has a record for every possible x with the default y fill set to 0.

This is somewhat trivial in R (tidyverse/tidyr):

df %>% 
    complete(nesting(user), x = c("a", "b", "c", "d"), fill = list(y = 0))

#    user     x     y
# 1     1     a     1
# 2     1     b     1
# 3     1     c     0
# 4     1     d     0
# 5     2     a     1
# 6     2     b     0
# 7     2     c     0
# 8     2     d     0
# 9     3     a     1
# 10    3     b     0
# 11    3     c     1
# 12    3     d     1

Is there a complete equivalent in pandas / python that will yield the same result?

like image 213
emehex Avatar asked May 31 '17 14:05

emehex


People also ask

Is there a dplyr for Python?

¶ Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy. The philosophy of Dplyr is to constrain data manipulation to a few simple functions that correspond to the most common tasks.

Is Pandas better than Tidyverse?

Pandas has the best performance but Tidyverse is exceptional in functionality and ease-of-use. Thus, data scientists can switch between programming language depending upon the necessities while performing analysis. This will enable them to optimise the code and reduce analysis processes.

Is Pandas similar to dplyr?

Both Pandas and dplyr can connect to virtually any data source, and read from any file format. That's why we won't spend any time exploring connection options but will use a build-in dataset instead. There's no winner in this Pandas vs. dplyr comparison, as both libraries are near identical with the syntax.


1 Answers

You can use reindex by MultiIndex.from_product:

df = df.set_index(['user','x'])
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x'])
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
    user  x  y
0      1  a  1
1      1  b  1
2      1  c  0
3      1  d  0
4      2  a  1
5      2  b  0
6      2  c  0
7      2  d  0
8      3  a  1
9      3  b  0
10     3  c  1
11     3  d  1

Or set_index + stack + unstack:

df = df.set_index(['user','x'])['y'].unstack(fill_value=0).stack().reset_index(name='y')
print (df)
    user  x  y
0      1  a  1
1      1  b  1
2      1  c  0
3      1  d  0
4      2  a  1
5      2  b  0
6      2  c  0
7      2  d  0
8      3  a  1
9      3  b  0
10     3  c  1
11     3  d  1
like image 85
jezrael Avatar answered Oct 10 '22 01:10

jezrael