Importing financial data into Python Pandas using read_csv

Tags:

I have a .csv with the following structure:

date_begin,date_end,name,name_code,active_accounts,transaction_amount,transaction_count
1/1/2008,1/31/2008,Name_1,1001,"123,456","$7,890,123.45","67,890"
2/1/2008,2/29/2008,Name_1,1001,"43,210","$987,654.32","109,876"
3/1/2008,3/31/2008,Name_1,1001,"485,079","$1,265,789,433.98","777,888"
...
12/1/2008,12/31/2008,Name_1,1001,"87,543","$432,098,987","87,987"
1/1/2008,1/31/2008,Name_2,1002,"268,456","$890,123.45","97,890"
2/1/2008,2/29/2008,Name_2,1002,"53,210","$987,654.32","109,876"
...
etc

I am trying to read them into into a pandas dataframe by using the following code:

import pandas as pd

data = pd.read_csv('my_awesome_csv.csv'),parse_dates=[[0,1]],
                   infer_datetime_format=True)

This works just fine except that I would like to control the data types in each column. When I run the following code in the interpreter I discover that the numbers in quotes do not get recognized as numbers, either dollars or otherwise.

In [10]: data.dtypes
Out[10]: 
date_begin_date_end       object
name                      object
name_code                  int64
active_accounts           object  # Problem, I want this to be a number
transaction_amount        object  # Ditto, I want this to be a number (it's a dollar amount)
transaction_count         object  # Still a number!
dtype: object

I have done some snooping around in the Pandas csv documentation but haven't found what I'm looking for about declaring types that are amounts when they are saved as strings with commas and dollar signs in the csv. My ultimate goal here is to be able to do some arithmetic operations on the values in these columns.

Any thoughts?

999

asked Dec 17 '14 20:12

invoker

1 Answers

You could use vectorized string methods to parse those columns after the call to read_csv:

import pandas as pd
import decimal
D = decimal.Decimal

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True)

for col in ('active_accounts', 'transaction_count'):
    data[col] = data[col].str.replace(r',', '').astype(int)

data['transaction_amount'] = (data['transaction_amount']
                              .str.replace(r'[^-+\d.]', '').astype(D))


print(data.dtypes)
# date_begin_date_end    object
# name                   object
# name_code               int64
# active_accounts         int64
# transaction_amount     object
# transaction_count       int64
# dtype: object

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876

PS. read_csv does have a converters parameter with which you could supply a function to parse the problematic columns. These functions are called once for each string. If you have a lot of rows, that could be require a lot of Python function calls. Handling the columns using vectorized string methods, as shown above should be far quicker.

import pandas as pd
import re
import decimal
D = decimal.Decimal

def make_parser(cls):
    def parse_commas(text):
        return cls(re.sub(r'[^-+\d.]', '', text))
    return parse_commas

to_int = make_parser(int)
to_decimal = make_parser(D)

data = pd.read_csv('data', parse_dates=[[0,1]], infer_datetime_format=True
                   , converters={4: to_int, 5: to_decimal, 6: to_int})

print(data)

yields

    date_begin_date_end    name  name_code  active_accounts  \
0    1/1/2008 1/31/2008  Name_1       1001           123456   
1    2/1/2008 2/29/2008  Name_1       1001            43210   
2    3/1/2008 3/31/2008  Name_1       1001           485079   
3  12/1/2008 12/31/2008  Name_1       1001            87543   
4    1/1/2008 1/31/2008  Name_2       1002           268456   
5    2/1/2008 2/29/2008  Name_2       1002            53210   

  transaction_amount  transaction_count  
0         7890123.45              67890  
1          987654.32             109876  
2      1265789433.98             777888  
3          432098987              87987  
4          890123.45              97890  
5          987654.32             109876

and the values in the transaction_amount column are decimal.Decimals:

In [64]: data.loc[0, 'transaction_amount']
Out[64]: Decimal('7890123.45')

answered Oct 12 '22 00:10

unutbu

Related questions
                            
                                Why doesn't join() automatically convert its arguments to strings? When would you ever not want them to be strings?
                            
                                Python: How to import from an __init__.py file?
                            
                                How to retrieve Facebook friend's information with Python-Social-auth and Django
                            
                                Pandas dataset into an array for modelling in Scikit-Learn
                            
                                converting from numpy array of one type to another by re-interpreting raw bytes
                            
                                Django - Serving MEDIA/uploaded files in production
                            
                                Random int without importing 'random'
                            
                                What's the difference between Series.sort() and Series.order()?
                            
                                Using pyDub to chop up a long audio file
                            
                                How can I return python's "import this" as a string?
                            
                                Cython says buffer types only allowed as function local variables even for ndarray.copy()
                            
                                Python Requests HTTPConnectionPool and Max retries exceeded with url
                            
                                Iterate over both text and elements in lxml etree
                            
                                multiprocessing breaks in interactive mode
                            
                                itertools.product slower than nested for loops
                            
                                Matplotlib pyplot axes formatter
                            
                                How to join different tables in sqlalchemy query having same column names?
                            
                                redis.exceptions.ConnectionError: Error -2 connecting to localhost:6379. Name or service not known
                            
                                Combining inplace filtering and the setting of encoding in the fileinput module
                            
                                How to catch the clipboard event (onChangeClipboard equivalent) from any application in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Importing financial data into Python Pandas using read_csv

Tags:

python

import

pandas

csv

invoker

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us