I have a data file (csv) with Nilsimsa hash values. Some of them would have as long as 80 characters. I wish to read them in Python for data analysis tasks. Is there a way to import the data in python without information loss?
EDIT: I have tried the implementations proposed in the comments but that does not work for me.
Example data in csv file would be: 77241756221441762028881402092817125017724447303212139981668021711613168152184106
Start with a simple text file to read in, just one variable and one row.
%more foo.txt
x
77241756221441762028881402092817125017724447303212139981668021711613168152184106
In [268]: df=pd.read_csv('foo.txt')
Pandas will read it in as a string because it's too big to store as a core number type like int64 or float64. But the info is there, you didn't lose anything.
In [269]: df.x
Out[269]:
0 7724175622144176202888140209281712501772444730...
Name: x, dtype: object
In [270]: type(df.x[0])
Out[270]: str
And you can use plain python to treat it as a number. Recall the caveats from the links in the comments, this isn't going to be as fast as stuff in numpy and pandas where you have stored a whole column as int64. This is using the more flexible but slower object mode to handle things.
You can change a column to be stored as longs (long integers) like this. (But note that the dtype is still object because everything except the core numpy types (int32, int64, float64, etc.) are stored as objects.)
In [271]: df.x = df.x.map(int)
And then can more or less treat it like a number.
In [272]: df.x * 2
Out[272]:
0 1544835124428835240577628041856342500354488946...
Name: x, dtype: object
You'll have to do some formatting to see the whole number. Or go the numpy route which will default to showing the whole number.
In [273]: df.x.values * 2
Out[273]: array([ 154483512442883524057762804185634250035448894606424279963336043423226336304368212L], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With