Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I import a text file with no separators in python, using numpy?

How do I import a file with no separators?

I have a file named text.txt which contains 2 lines of text:

00000000011100000000000000000000
00000000011111110000000000000000

When I use

f = open("text.txt")
data = np.loadtxt(f)

I get

[ 1.11000000e+22 1.11111100e+22]

Using sep="" changes nothing.

I would like to get this result, in the form of many single digit integers:

[ [00000000011100000000000000000000]
[00000000011111110000000000000000] ]

Any help is appreciated.

UPDATE: Thank you all for the great answers and the many valid solutions to an awkward question.

like image 395
tumultous_rooster Avatar asked Jan 17 '26 14:01

tumultous_rooster


2 Answers

I'll take the statement "I would like to get this result, in the form of many single digit integers:" literally, and ignore the format of the sample that follows it (which appears to be just two integers, rather than many single digit integers). You can do that with genfromtxt by using the arguments delimiter=1 and dtype=int. When delimiter is an integer or a sequence of integers, the values are interpreted as the field widths of a file containing fixed-width fields of data.

For example:

In [15]: genfromtxt('text.txt', delimiter=1, dtype=int)
Out[15]: 
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
like image 138
Warren Weckesser Avatar answered Jan 20 '26 05:01

Warren Weckesser


If you don't give numpy any guidance, it has to guess the types you want.

If your data look like decimal-format integers, it will try to interpret them that way and fit them into an int32. But 00000000011100000000000000000000 (which is obviously equal to 11100000000000000000000) takes 74 bits, so that won't work. So, it falls back to storing them in a float64.

If you didn't realize that 1.11E22 means the same thing as 11100000000000000000000, you need to read up on scientific notation. 1.11E22 is Python (and C, and many other programming languages) shortcut for 1.11 * 10**22. Anyway, the reason you're getting scientific notation is that the default printout for an array of float64 is %g-style, meaning something like "simple notation if -4 <= exponent < precision, otherwise exponential".

So, that's why you get [1.11000000e+22 1.11111100e+22].


The reason you get an array of shape (2,) instead of (1, 2) is that by default, loadtxt squeezes mono-dimensional axes. Add ndmin=2 if that's what you want.


If you ask NumPy to treat the data as strings, it will guess the right length, and read them as strings:

>>> np.loadtxt(f, dtype=str, ndmin=2)
array([['00000000011100000000000000000000'],
       ['00000000011111110000000000000000']],
      dtype='|S32')

Or, if you ask it to treat the data as Python objects, it'll leave them as Python str objects:

>>> np.loadtxt(f, dtype=object, ndmin=2)
array([['00000000011100000000000000000000'],
       ['00000000011111110000000000000000']],
      dtype=object)

If you want them to be 128-bit integers… well, you probably don't have int128 support in your build, so you can't have that.

If you were hoping for them to be interpreted as bit strings and stored in 32-bit ints, you have to do that in two steps. I don't think NumPy can vectorize parsing bit strings usefully, so you might as well do that part in Python:

>>> np.fromiter((int(line, 2) for line in f), dtype=int)
array([7340032, 8323072])

If you want them interpreter as single-digit integers, there's no way to do that directly, but you can do that in two steps as well (e.g., read it as an array of 2 strings, treat each string as a sequence of characters, broadcast np.vectorize(int) over it).

Almost anything you want to do is doable, but you have to actually know what you want to do and be able to explain it to a human before you'll be able to explain it to numpy.

like image 41
abarnert Avatar answered Jan 20 '26 06:01

abarnert