Numpy: loadtxt() with variable number of columns



I have a file of tab-separated values where the first half of the file has 3 columns and N rows and the second half has 2 columns and M rows. I need to convert such a file into two separate arrays: a 3xN and a 2xM.


   6.7900209022264466       -3.8259897286289504        13.563976248832137     
   1.5334543760683907        12.723711617874176        1.5148291755004299     
   2.4282763900233522        9.1305022788201136       -3.1003673775485394     
  -6.5344717544805586E-002  -12.487743380186622        2.6928902187606480     
   8.9067951331740804        13.403331728374390      -0.58045132774289632     
  -11.842481592786449       -5.7083783211328551        1.9526760053685255     
  -10.240286781275808        13.204312088815593        4.4856524683466175     
  -4.6690658488407504       -6.2809313597959449        7.4378900284937082     
  -9.5874077836478282       -8.6799071183782903       -1.8203838010218165     
  0.62588896716878051       -5.4614995295716540        11.166650096421838     
           0        4173
           0        1998
           0         611
           0        8606
           1        6912
           1        9671
           1        7993
           1        8513
           2        5556
           2        4422
           2        3047

I cannot simply use loadtxt() to read such a file because this would result in the error ValueError: Wrong number of columns at line ...

Is there a way to use loadtxt() or some similar function to read such a file?

I would like to avoid using readlines() and split() and then convert to float, because this would make the code slower (I think...) and longer. I have also tried pandas.read_csv(), but I need an array as output.


For now, following hpaulj's suggestion, I'm doing it like this using readlines() and split():

    with open(filename,"r") as f:
        all_data=[x.split() for x in f.readlines()]
        a=array([map(float,x) for x in all_data[:N]])
        b=array([map(int,x) for x in all_data[N+1:]])

It is actually pretty fast, but I would still like to know if someone knows a faster -and maybe simpler- method.

1 Answers

I would recommend using pandas.read_csv() and then obtaining the numpy array using the .values attribute from the DataFrame - see documentation

import pandas as pd
import numpy as np

df = pd.read_csv("filename.txt")
array_values = df.values

Right now if you just use .values then you will get nan for the missing values. You can determine M and N by checking for indices that contain nan for the missing values.

