Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excel worksheet to Numpy array

I'm trying to do an unbelievably simple thing: load parts of an Excel worksheet into a Numpy array. I've found a kludge that works, but it is embarrassingly unpythonic: say my worksheet was loaded as "ws", the code:

A = np.zeros((37,3))
for i in range(2,39):
   for j in range(1,4):
      A[i-2,j-1]= ws.cell(row = i, column = j).value

loads the contents of "ws" into array A.

There MUST be a more elegant way to do this. For instance, csvread allows to do this much more naturally, and while I could well convert the .xlsx file into a csv one, the whole purpose of working with openpyxl was to avoid that conversion. So there we are, Collective Wisdom of the Mighty Intertubes: what's a more pythonic way to perform this conceptually trivial operation?

Thank you in advance for your answers.

PS: I operate Python 2.7.5 on a Mac via Spyder, and yes, I did read the openpyxl tutorial, which is the only reason I got this far.

like image 249
El Niño Avatar asked Jun 08 '15 05:06

El Niño


People also ask

How do I import a CSV file into Numpy?

Python NumPy read CSV into 2d NumPy arraytxt() and open() functions to load a CSV file into a 2Dimension NumPy Array. Call open file to open the CSV text file Use numpy. loadtxt( CSV file, delimiter) with the file as the result of the previous step and delimiter as “,” to return the data in a two-dimensional NumPy.


2 Answers

You could do

A = np.array([[i.value for i in j] for j in ws['C1':'E38']])

EDIT - further explanation. (firstly thanks for introducing me to openpyxl, I suspect I will use it quite a bit from time to time)

  1. the method of getting multiple cells from the worksheet object produces a generator. This is probably much more efficient if you want to work your way through a large sheet as you can start straight away without waiting for it all to load into your list.
  2. to force a generator to make a list you can either use list(ws['C1':'E38']) or a list comprehension as above
  3. each row is a tuple (even if only one column wide) of
  4. Cell objects. These have a lot more about them than just a number but if you want to get the number for your array you can use the .value attribute. This is really the crux of your question, csv files don't contain the structured info of an excel spreadsheet.
  5. there isn't (as far as I can tell) a built in method for extracting values from a range of cells so you will have to do something effectively as you have sketched out.

The advantages of doing it my way are: no need to work out the dimension of the array and make an empty one to start with, no need to work out the corrected index number of the np array, list comprehensions faster. Disadvantage is that it needs the "corners" defining in "A1" format. If the range isn't know then you would have to use iter_rows, rows or columns

A = np.array([[i.value for i in j[2:5]] for j in ws.rows])

if you don't know how many columns then you will have to loop and check values more like your original idea

like image 101
paddyg Avatar answered Oct 17 '22 02:10

paddyg


If you don't need to load data from multiple files in an automated manner, the package tableconvert I recently wrote may help. Just copy and paste the relevant cells from the excel file into a multiline string and use the convert() function.

import numpy as np
from tableconvert.converter import convert

array = convert("""
123    456    3.14159
SOMETEXT    2,71828    0
""")

print(type(array))
print(array)

Output:

<class 'numpy.ndarray'>
[[ 123.       456.         3.14159]
 [       nan    2.71828    0.     ]]
like image 35
Padix Key Avatar answered Oct 17 '22 04:10

Padix Key