I have a tab separated file with 1 billion lines of these (Imagine 200 columns, instead of 3):
abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
I want to create a dictionary where the string in the first column is the key and the rest are the values. I've been doing it like this but it's computationally expensive:
import io
dictionary = {}
with io.open('bigfile', 'r') as fin:
for line in fin:
kv = line.strip().split()
k, v = kv[0], kv[1:]
dictionary[k] = list(map(float, v))
How else can I do get the desired dictionary? Actually a numpy array would be more appropriate than a list of floats for the value.
You can use pandas to load the df, then construct a new df as desired and then call to_dict
:
In [99]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
EDIT
A more dynamic method and one which would reduce the need to construct a temporary df:
In [121]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
Further update
Actually you don't need the first read, the column length can be implicitly derived by the number of columns in the first column anyway:
In [128]:
t="""abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
1: -0.98080000000000001,
2: 0.23123000000000002},
'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
You could make use of the numpy.genfromtxt()
function, if you specify the number of columns:
import numpy as np
a = np.genfromtxt('bigfile.csv',dtype=str,usecols=(0))
b = np.genfromtxt('bigfile.csv',dtype=float,delimiter='\t',usecols=range(1,4))
#^enter # of cols here
d = dict(zip(a,b.tolist())) #if you want a numpy array, just remove .tolist()
print d
Output:
{'abc': [-0.123, 0.6524, 0.325], 'bar': [0.23123, -0.123124, -0.1232], 'foo': [-0.9808, 0.874, -0.2341]}
Note: To programatically find the number of cols
you could do:
with open('bigfile.csv', 'r') as f:
num_cols = len(f.readline().split())
And then use num_cols
for the usecols
parameter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With