Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dtypes muck things up when shifting on axis one (columns)

Tags:

python

pandas

Consider the dataframe df

df = pd.DataFrame(dict(A=[1, 2], B=['X', 'Y']))

df

   A  B
0  1  X
1  2  Y

If I shift along axis=0 (the default)

df.shift()

     A    B
0  NaN  NaN
1  1.0    X

It pushes all rows downwards one row as expected.

But when I shift along axis=1

df.shift(axis=1)

    A    B
0 NaN  NaN
1 NaN  NaN

Everything is null when I expected

     A  B
0  NaN  1
1  NaN  2

I understand why this happened. For axis=0, Pandas is operating column by column where each column is a single dtype and when shifting, there is clear protocol on how to deal with the introduced NaN value at the beginning or end. But when shifting along axis=1 we introduce potential ambiguity of dtype from one column to the next. In this case, I'm trying for force int64 into an object column and Pandas decides to just null the values.

This becomes more problematic when the dtypes are int64 and float64

df = pd.DataFrame(dict(A=[1, 2], B=[1., 2.]))

df

   A    B
0  1  1.0
1  2  2.0

And the same thing happens

df.shift(axis=1)

    A   B
0 NaN NaN
1 NaN NaN

My Question

What are good options for creating a dataframe that is shifted along axis=1 in which the result has shifted values and dtypes?

For the int64/float64 case the result would look like:

df_shifted

     A  B
0  NaN  1
1  NaN  2

and

df_shifted.dtypes

A    object
B     int64
dtype: object

A more comprehensive example

df = pd.DataFrame(dict(A=[1, 2], B=[1., 2.], C=['X', 'Y'], D=[4., 5.], E=[4, 5]))

df

   A    B  C    D  E
0  1  1.0  X  4.0  4
1  2  2.0  Y  5.0  5

Should look like this

df_shifted

     A  B    C  D    E
0  NaN  1  1.0  X  4.0
1  NaN  2  2.0  Y  5.0

df_shifted.dtypes

A     object
B      int64
C    float64
D     object
E    float64
dtype: object
like image 986
piRSquared Avatar asked Nov 05 '19 16:11

piRSquared


People also ask

How do I get Dtype of pandas column?

To check the data type in pandas DataFrame we can use the “dtype” attribute. The attribute returns a series with the data type of each column. And the column names of the DataFrame are represented as the index of the resultant series object and the corresponding data types are returned as values of the series object.

How do you specify Dtype in pandas?

Cast a pandas object to a specified dtype dtype . Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame's columns to column-specific types.

What is object data type in pandas?

The main types stored in pandas objects are float, int, bool, datetime64[ns], timedelta[ns], and object. In addition these dtypes have item sizes, e.g. int64 and int32. By default integer types are int64 and float types are float64, REGARDLESS of platform (32-bit or 64-bit).


2 Answers

It turns out that Pandas is shifting over blocks of similar dtypes

Define df as

df = pd.DataFrame(dict(
    A=[1, 2], B=[3., 4.], C=['X', 'Y'],
    D=[5., 6.], E=[7, 8], F=['W', 'Z']
))

df

#  i    f  o    f  i  o
#  n    l  b    l  n  b
#  t    t  j    t  t  j
#
   A    B  C    D  E  F
0  1  3.0  X  5.0  7  W
1  2  4.0  Y  6.0  8  Z

It will shift the integers to the next integer column, the floats to the next float column and the objects to the next object column

df.shift(axis=1)

    A   B    C    D    E  F
0 NaN NaN  NaN  3.0  1.0  X
1 NaN NaN  NaN  4.0  2.0  Y

I don't know if that's a good idea, but that is what is happening.


Approaches

astype(object) first

dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.astype(object).shift(1, axis=1).astype(dtypes)

df_shifted

     A  B    C  D    E  F
0  NaN  1  3.0  X  5.0  7
1  NaN  2  4.0  Y  6.0  8

transpose

Will make it object

dtypes = df.dtypes.shift(fill_value=object)
df_shifted = df.T.shift().T.astype(dtypes)

df_shifted

     A  B    C  D    E  F
0  NaN  1  3.0  X  5.0  7
1  NaN  2  4.0  Y  6.0  8

itertuples

pd.DataFrame([(np.nan, *t[1:-1]) for t in df.itertuples()], columns=[*df])

     A  B    C  D    E  F
0  NaN  1  3.0  X  5.0  7
1  NaN  2  4.0  Y  6.0  8

Though I'd probably do this

pd.DataFrame([
    (np.nan, *t[:-1]) for t in
    df.itertuples(index=False, name=None)
], columns=[*df])
like image 109
piRSquared Avatar answered Sep 26 '22 02:09

piRSquared


I tried using a numpy method. The method works as long as you keep your data in a numpy array:

def shift_df(data, n):
    shifted = np.roll(data, n)
    shifted[:, :n] = np.NaN

    return shifted

shifted(df, 1)

array([[nan, 1, 1.0, 'X', 4.0],
       [nan, 2, 2.0, 'Y', 5.0]], dtype=object)

But when you call the DataFrame constructer, all columns are converted to object although the values in the array are float, int, object:

def shift_df(data, n):
    shifted = np.roll(data, n)
    shifted[:, :n] = np.NaN
    shifted = pd.DataFrame(shifted)

    return shifted

print(shift_df(df, 1),'\n')
print(shift_df(df, 1).dtypes)

     0  1  2  3  4
0  NaN  1  1  X  4
1  NaN  2  2  Y  5 

0    object
1    object
2    object
3    object
4    object
dtype: object
like image 26
Erfan Avatar answered Sep 22 '22 02:09

Erfan