Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a tab separated file with first column as key and the rest as values

I have a tab separated file with 1 billion lines of these (Imagine 200 columns, instead of 3):

abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232

I want to create a dictionary where the string in the first column is the key and the rest are the values. I've been doing it like this but it's computationally expensive:

import io

dictionary = {}

with io.open('bigfile', 'r') as fin:
    for line in fin:
        kv = line.strip().split()
        k, v = kv[0], kv[1:]
        dictionary[k] = list(map(float, v))

How else can I do get the desired dictionary? Actually a numpy array would be more appropriate than a list of floats for the value.

like image 595
alvas Avatar asked Apr 28 '15 13:04

alvas


2 Answers

You can use pandas to load the df, then construct a new df as desired and then call to_dict:

In [99]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None)
df = pd.DataFrame(columns = df[0], data = df.ix[:,1:].values)
df.to_dict()
Out[99]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

EDIT

A more dynamic method and one which would reduce the need to construct a temporary df:

In [121]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
# determine the number of cols, we'll use this in usecols
col_len = pd.read_csv(io.StringIO(t), sep='\s+', nrows=1).shape[1]
col_len
# read the first col we'll use this in names
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
# now read and construct the df using the determined usecols and names from above
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, col_len)), names = cols)
df.to_dict()
Out[121]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}

Further update

Actually you don't need the first read, the column length can be implicitly derived by the number of columns in the first column anyway:

In [128]:

t="""abc -0.123  0.6524  0.325
foo -0.9808 0.874   -0.2341 
bar 0.23123 -0.123124   -0.1232"""
cols = pd.read_csv(io.StringIO(t), sep='\s+', usecols=[0], header=None)[0].values
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, usecols = list(range(1, len(cols)+1)), names = cols)
df.to_dict()
Out[128]:
{'abc': {0: -0.12300000000000001,
  1: -0.98080000000000001,
  2: 0.23123000000000002},
 'bar': {0: 0.32500000000000001, 1: -0.2341, 2: -0.1232},
 'foo': {0: 0.65239999999999998, 1: 0.87400000000000011, 2: -0.123124}}
like image 109
EdChum Avatar answered Nov 10 '22 11:11

EdChum


You could make use of the numpy.genfromtxt() function, if you specify the number of columns:

import numpy as np

a = np.genfromtxt('bigfile.csv',dtype=str,usecols=(0)) 
b = np.genfromtxt('bigfile.csv',dtype=float,delimiter='\t',usecols=range(1,4)) 
                                                                             #^enter # of cols here

d = dict(zip(a,b.tolist()))    #if you want a numpy array, just remove .tolist()

print d

Output:

{'abc': [-0.123, 0.6524, 0.325], 'bar': [0.23123, -0.123124, -0.1232], 'foo': [-0.9808, 0.874, -0.2341]}

Note: To programatically find the number of cols you could do:

with open('bigfile.csv', 'r') as f:
    num_cols = len(f.readline().split())

And then use num_cols for the usecols parameter.

like image 3
logic Avatar answered Nov 10 '22 13:11

logic