Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to strip b' ' from my Numpy array's savetxt() representation

So I have what I feel is a very dumb problem.

I create an array from a file:

A1=np.loadtxt(file, dtype='a100')

I want to write that array after it's done processing to another file:

np.savetxt("Test.txt", A1, fmt=%s, delimiter=',')

Why is it writing out b'string'? I think I understand it's writing it out as byte but for the life of me I can't figure out how to write it out without the b''.

I know this is probably something incredibly easy I'm overlooking!

like image 559
user2624599 Avatar asked Dec 16 '14 18:12

user2624599


1 Answers

A1 is loaded as an array of bytestrings. Python3 used unicode strings as default, so usually prepends them with the 'b'. That's normal with print. I'm a little surprised that it does so also during the file write.

In any case, this seems to do the trick:

A2=np.array([x.decode() for x in A1])
np.savetxt("Test.txt", A2, fmt='%s', delimiter=',')

A2 will have a dtype like dtype='<U100'.


My test array is:

array([b'one.com', b'two.url', b'three.four'], dtype='|S10')

loaded from a simple text file:

one.com
two.url
three.four

.decode is a string method. [x.decode() for x in A1] works for a simple 1d array of bytestrings. If A1 is 2d, the iteration has to be done over all elements, not just the rows. And if A1 is structured array, is has to be applied to the strings within the elements.


Another possibility is to use a converter during load, so you get an array of (unicode) strings

In [508]: A1=np.loadtxt('urls.txt', dtype='U',
    converters={0:lambda x:x.decode()})
In [509]: A1
Out[509]: 
array(['one.com', 'two.url', 'three.four'], dtype='<U10')
In [510]: np.savetxt('test0.txt',A1,fmt='%s')
In [511]: cat test0.txt
one.com
two.url
three.four

The lib that contains loadtxt has a couple of converter functions, asbytes, asbytes_nested, and asstr. So converters could also be: converters={0:np.lib.npyio.asstr}.

genfromtxt handles this without converters:

 A1=np.genfromtxt('urls.txt', dtype='U')
 # array(['one.com', 'two.url', 'three.four'], dtype='<U10')

To understand why savetxt save unicode strings as we want, but appends the b for bytestrings, we have to look at its code.

np.savetxt (running on py3) is essentially:

fh = open(fname, 'wb')
X = np.atleast_2d(X).T
# make a 'fmt' that matches the columns of X (with delimiters)
for row in X:
    fh.write(asbytes(format % tuple(row) + newline))

Looking at two sample strings (str and bytestr):

In [617]: asbytes('%s'%tuple(['one.two']))
Out[617]: b'one.two'

In [618]: asbytes('%s'%tuple([b'one.two']))
Out[618]: b"b'one.two'"

Writing to a 'wb' file removes that outer layer of b'', leaving the inner for the bytestring. It also explains why strings ('plain' py3 unicode) are written as 'latin1' strings to the file.


You could write a bytestrings array directly, without savetxt. For example:

A0 = array([b'one.com', b'two.url', b'three.four'], dtype='|S10')
with open('test0.txt','wb') as f:
    for x in A0:
        f.write(x+b'\n')

cat test0.txt
    one.com
    two.url
    three.four

Unicode strings can also be written directly, producing the same file:

A1 = array(['one.com', 'two.url', 'three.four'], dtype='<U10')
with open('test1.txt','w') as f:
    for x in A1:
        f.write(x+'\n')

The default encoding for such a file is encoding='UTF-8', the same as used with 'one.com'.encode(). The effect it is the same as what savetxt does:

with open('test1.txt','wb') as f:
    for x in A1:
        f.write(x.encode()+b'\n')

np.char has .encode and .decode methods, which appear to operate iteratively on the elements of an array.

Thus

 np.char.decode(A1)   # convert |S10 to <U10, like [x.decode() for x in A1]
 np.char.encode(A1)   # convert <U10 to |S10

This works with multidimensional arrays

 np.savetxt('testm.txt',np.char.decode(A_bytes[:,None][:,[0,0]]),
     fmt='%s',delimiter=',  ')

With a structured array, np.char.decode has to be applied individually to each of the char fields.

like image 128
hpaulj Avatar answered Oct 19 '22 12:10

hpaulj