Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Column mathematical operations No error no answer

I am trying to perform some simple mathematical operations on the files.

The columns in below file_1.csv are dynamic in nature the number of columns will increased from time to time. So we cannot have fixed last_column

master_ids.csv : Before any pre-processing

Ids,ref0 #the columns increase dynamically
1234,1000
8435,5243
2341,563
7352,345

master_count.csv : Before any processing

Ids,Name,lat,lon,ref1
1234,London,40.4,10.1,500
8435,Paris,50.5,20.2,400
2341,NewYork,60.6,30.3,700
7352,Japan,70.7,80.8,500
1234,Prague,40.4,10.1,100
8435,Berlin,50.5,20.2,200
2341,Austria,60.6,30.3,500
7352,China,70.7,80.8,300

master_Ids.csv : after one pre-processing

Ids,ref,00:30:00
1234,1000,500
8435,5243,300
2341,563,400
7352,345,500

master_count.csv: expected Output (Append/merge)

Ids,Name,lat,lon,ref1,00:30:00
1234,London,40.4,10.1,500,750
8435,Paris,50.5,20.2,400,550
2341,NewYork,60.6,30.3,700,900
7352,Japan,70.7,80.8,500,750
1234,Prague,40.4,10.1,100,350
8435,Berlin,50.5,20.2,200,350
2341,Austria,60.6,30.3,500,700
7352,China,70.7,80.8,300,750

Eg: Ids: 1234 appears 2 times so the value of ids:1234 at current time (00:30:00) is 500 which is to be divided by count of ids occurrence and then add to the corresponding values from ref1 and create a new column with the current time.

master_Ids.csv : After another pre-processing

Ids,ref,00:30:00,00:45:00
1234,1000,500,100
8435,5243,300,200
2341,563,400,400
7352,345,500,600

master_count.csv: expected output after another execution (Merge/append)

Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750,550
8435,Paris,50.5,20.2,400,550,500
2341,NewYork,60.6,30.3,700,900,900
7352,Japan,70.7,80.8,500,750,800
1234,Prague,40.4,10.1,100,350,150
8435,Berlin,50.5,20.2,200,350,300
2341,Austria,60.6,30.3,500,700,700
7352,China,70.7,80.8,300,750,600

So here current time is 00:45:00, and we divide the current time value by the count of ids occurrences, and then add to the corresponding ref1 values by creating an new column with new current time.

Program: By Jianxun Li

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_ids.csv'
csv_file2 = '/Data_repository/master_count.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column any time series
def my_func(group):
    num_obs = len(group)
    # process with column name after next timeseries (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

The program executes with no errors and no output. Need some fixing suggestions please.

like image 483
Sitz Blogz Avatar asked Jul 01 '15 18:07

Sitz Blogz


3 Answers

This program assumes updating of both master_counts.csv and master_ids.csv over time and should be robust to the timing of the updates. That is, it should produce correct results if run multiple times on the same update or if an update is missed.

# this program updates (and replaces) the original master_counts.csv with data
# in master_ids.csv, so we only want the first 5 columns when we read it in
master_counts = pd.read_csv('master_counts.csv').iloc[:,:5]

# this file is assumed to be periodically updated with the addition of new columns
master_ids = pd.read_csv('master_ids.csv')

for i in range( 2, len(master_ids.columns) ):
    master_counts = master_counts.merge( master_ids.iloc[:,[0,i]], on='Ids' )
    count = master_counts.groupby('Ids')['ref1'].transform('count')
    master_counts.iloc[:,-1] = master_counts['ref1'] + master_counts.iloc[:,-1]/count

master_counts.to_csv('master_counts.csv',index=False)

%more master_counts.csv
Ids,Name,lat,lon,ref1,00:30:00,00:45:00
1234,London,40.4,10.1,500,750.0,550.0
1234,Prague,40.4,10.1,100,350.0,150.0
8435,Paris,50.5,20.2,400,550.0,500.0
8435,Berlin,50.5,20.2,200,350.0,300.0
2341,NewYork,60.6,30.3,700,900.0,900.0
2341,Austria,60.6,30.3,500,700.0,700.0
7352,Japan,70.7,80.8,500,750.0,800.0
7352,China,70.7,80.8,300,550.0,600.0
like image 182
JohnE Avatar answered Oct 25 '22 22:10

JohnE


import pandas as pd
import numpy as np

csv_file1 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/master_lac_Test.csv'
csv_file2 = '/home/Jian/Downloads/stack_flow_bundle/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

Out[53]: 
      00:00:00  00:30:00  00:45:00
Ids                               
1234      1000       500       100
8435      5243       300       200
2341       563       400       400
7352       345       500       600

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

Out[81]: 
         Name   lat   lon  00:00:00
Ids                                
1234   London  40.4  10.1       500
1234   Prague  40.4  10.1       500
2341  NewYork  60.6  30.3       700
2341  Austria  60.6  30.3       700
7352    Japan  70.7  80.8       500
7352    China  70.7  80.8       500
8435    Paris  50.5  20.2       400
8435   Berlin  50.5  20.2       400

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])



Out[55]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       500       100
1234   Prague  40.4  10.1       500       500       100
2341  NewYork  60.6  30.3       700       400       400
2341  Austria  60.6  30.3       700       400       400
7352    Japan  70.7  80.8       500       500       600
7352    China  70.7  80.8       500       500       600
8435    Paris  50.5  20.2       400       300       200
8435   Berlin  50.5  20.2       400       300       200

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group



result = temp.groupby(level='Ids').apply(my_func)

Out[104]: 
         Name   lat   lon  00:00:00  00:30:00  00:45:00
Ids                                                    
1234   London  40.4  10.1       500       750       550
1234   Prague  40.4  10.1       500       750       550
2341  NewYork  60.6  30.3       700       900       900
2341  Austria  60.6  30.3       700       900       900
7352    Japan  70.7  80.8       500       750       800
7352    China  70.7  80.8       500       750       800
8435    Paris  50.5  20.2       400       550       500
8435   Berlin  50.5  20.2       400       550       500
like image 24
Jianxun Li Avatar answered Oct 25 '22 21:10

Jianxun Li


My suggestion is to reformat your data so that it's like this:

Ids,ref0,current_time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None

Then after your "first preprocess" it will become like this:

Ids,ref0,time,ref1
1234,1000,None,None
8435,5243,None,None
2341,563,None,None
7352,345,None,None
1234,1000,00:30:00,500
8435,5243,00:30:00,300
2341,563,00:30:00,400
7352,345,00:30:00,500

. . . and so on. The idea is that you should make a single column to hold the time information, and then for each preprocess, insert the new data into new rows, and give those rows a value in the time column indicating what time period they come from. You may or may not want to keep the initial rows with "None" in this table; maybe you just want to start with the "00:30:00" values and keep the "master ids" in a separate file.

I haven't totally followed exactly how you're computing the new ref1 values, but the point is that doing this is likely to greatly simplify your life. In general, instead of adding an unbounded number of new columns, it can be much nicer to add a single new column whose values will then be the values you were going to use as headers for the open-ended new columns.

like image 34
BrenBarn Avatar answered Oct 25 '22 22:10

BrenBarn