Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Transform dataframe and slicing

Tags:

python

pandas

I have attached a screenshot to help explain. I have a dataframe pulled from cleveland heart dataset that takes 76 columns and puts them into 7 columns and wraps the additional columns into the next row. I am trying to figure out how to get that dataframe into a readable format as shown in the dataframe on the right-hand side.

enter image description here

The variable xyz will always be the same but the other letter variables I have listed will be different. I thought I could use data.loc[:, :'xyz'] to start but I'm not sure where to go from here:

data = pd.read_csv("../resources/cleveland.data")
data.loc[:, :'xyz']

I will then have to go from there and assign column names to these variables. Surprisingly, the train, test, validate portion of this will be much easier once I get this sorted out. Thanks in advance for the help. (I'm a rookie)

like image 262
bross Avatar asked Oct 17 '22 06:10

bross


2 Answers

Input data

1   a   b   c
d   xyz 2   e
f   g   h   xyz
3   i   j   k

Code

import pandas as pd
import numpy as np

# The initial data doesn't contain header so set header to None
df = pd.read_csv("../resources/cleveland.data", header=None)
cols = df.columns.tolist()

# Reset the index to get the line number in the durty file
df = df.reset_index()

# After having melt the df, you can filter the df in order to have every values in one column.
# Those values are in the right order
df = pd.melt(df, id_vars=['index'], value_vars=cols)
df = df.sort_values(by=['index', 'variable'])

# Then you can set the line number
df['line'] = np.where(df.value == 'xyz', 1, np.nan)
df.line = df.line.cumsum()
df.line = df.line.bfill()

# If the file doesn't end with 'xyz', we have to set the line number to df.line.max() + 1
df.loc[df.line.isna(), 'line'] = df.line.max() + 1
df.line = df.line.ffill()

# We can set the column names as interger with a groupby cumsum
df['one'] = 1
df['col_name'] = df.groupby(['line'])['one'].cumsum()
df['col_name'] = "col_" + df['col_name'].astype('str')

# Then we can pivot the table
df = df[['value', 'line', 'col_name']]
df = df.pivot(index='line', columns='col_name', values='value')
print(df)

Output Data

col_name col_1 col_2 col_3 col_4 col_5 col_6
line
1.0          1     a     b     c     d   xyz
2.0          2     e     f     g     h   xyz
3.0          3     i     j     k   NaN   NaN
like image 53
Charles R Avatar answered Nov 03 '22 06:11

Charles R


Use numpy for this, after forming one big array of all values. A combination of np.array_split + np.where to split on the indices after xyz:

Sample Data: test.csv

1,a,b,c,d,e,f,g
h,i,j,k,xyz,2,a,b
c,d,e,f,g,h,i,j
k,xyz

Code

import numpy as np
import pandas as pd

arr = pd.read_csv('test.csv', header=None).values.ravel()

pd.DataFrame(np.array_split(arr, np.where(arr == 'xyz')[0]+1)).dropna(how='all')

Output:

  0  1  2  3  4  5  6  7  8  9  10 11   12
0  1  a  b  c  d  e  f  g  h  i  j  k  xyz
1  2  a  b  c  d  e  f  g  h  i  j  k  xyz

From @CharlesR data

   0  1  2  3     4     5
0  1  a  b  c     d   xyz
1  2  e  f  g     h   xyz
2  3  i  j  k  None  None
like image 22
ALollz Avatar answered Nov 03 '22 05:11

ALollz