Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing financial data into Python Pandas using read_csv

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc

I am trying to read them into into a pandas dataframe by using the following code:

import pandas as pd

data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
                   infer_datetime_format=True)

This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.

In [10]: data.dtypes
Out[10]: 
date_begin_date_end       object
name                      object
name_code                  int64
active_accounts           object  # Problem, I want this to be a number
transaction_amount        object  # Ditto, I want this to be a number (it's a dollar amount)
transaction_count         object  # Still a number!
dtype: object

I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.

Any thoughts?

like image 999
invoker Avatar asked Dec 17 '14 20:12

invoker


People also ask

What does read_csv do in pandas?

The pandas. read_csv is used to load a CSV file as a pandas dataframe. In this article, you will learn the different features of the read_csv function of pandas apart from loading the CSV file and the parameters which can be customized to get better output from the read_csv function.

Can pandas read CSV?

The pandas function read_csv() reads in values, where the delimiter is a comma character. You can export a file into a csv file in any modern office suite including Google Sheets.


1 Answers

You could use vectorized string methods to parse those columns after the call to read_csv:

import pandas as pd
import decimal
D = decimal.Decimal

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)

for col in ('active_accounts', 'transaction_count'):
    data[col] = data[col].str.replace(r',', '').astype(int)

data['transaction_amount'] = (data['transaction_amount']
                              .str.replace(r'[^-+\d.]', '').astype(D))


print(data.dtypes)
# date_begin_date_end    object
# name                   object
# name_code               int64
# active_accounts         int64
# transaction_amount     object
# transaction_count       int64
# dtype: object

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

PS. read_csv does have a converters parameter with which you could supply a function to parse the problematic columns. These functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function calls. Handling the columns using vectorized string methods, as shown above should be far quicker.

import pandas as pd
import re
import decimal
D = decimal.Decimal

def make_parser(cls):
    def parse_commas(text):
        return cls(re.sub(r'[^-+\d.]', '', text))
    return parse_commas

to_int = make_parser(int)
to_decimal = make_parser(D)

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                   , converters={4: to_int, 5: to_decimal, 6: to_int})

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876  

and the values in the transaction_amount column are decimal.Decimals:

In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')
like image 54
unutbu Avatar answered Oct 12 '22 00:10

unutbu