Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OleDb connection to Excel; how do I select fixed width, unbounded height?

Tags:

c#

excel

oledb

I'm using OleDb to select data from excel spreadsheets. Each spreadsheet can contain many small tables, and possibly furniture like titles and labels. So it might look like this, where we have two tables and some titles;

            A           B         C          D
    1    .           .         .          .
    2    .           .         .          .
    3    Table1      .         .          .
    4    Header1     HEADER2   .          .
    5    h           huey      .          .
    6    d           dewey     .          .
    7    l           loius     .          .
    8    s           scrooge   .          .
    9    .           .         .          .
    10   .           .         .          .
    11   .           .         .          .
    12   .           .         .          .
    13   .           Table 2   .          .
    14   .           HEADER1   HEADER2    HEADER3
    15   .           1         foo        x
    16   .           2         bar        y
    17   .           3         baz        z
    18   .           .         .          .
    19   .           .         .          .

In a previous step, the user has selected the headers of the table they are interested in; in this case, looking at table 2 they will have selected the range B14:D14.

These settings are saved, and then I need to query that table. It may happen over and over, as the spreadsheet data is updated; more rows may be added at any time, but the headers are always fixed. There is a sentinel (blank row) marking the end of data

To select the data in the table, I'm writing a query like this;

SELECT * FROM [Sheet1$B14:D65535]

to select the data in table 2, and then manually checking for the sentinel row, but this seems unsatisfying. Excel 2003 can only read 65,535 rows (uint16), but excel 2007 can read many more (uint32), so I have to write code which gives a different query for Excel 2003 and 2007 based on the extension of the file (.xls vs .xls?).

Does anyone know of a way to write a query that says either;

  • 'select everything down and right of B14'?
  • 'select everything in columns B->D'
  • 'select B12:D*' where * means 'everything you can'
like image 313
Steve Cooper Avatar asked Jul 16 '09 14:07

Steve Cooper


2 Answers

Pre-requisite: you can easily determine in your code what the maximum number number of rows is.

Assuming (1) there's a big overhead per SELECT, so SELECTing a row at a time is slow (2) SELECTing 64K or 8M rows (even if blank) is slow ... so you want to see if somewhere in the middle can be faster. Try this:

Select CHUNKSIZE (e.g. 100 or 1000) rows at a time (less when you would otherwise over-run MAX_ROWS). Scan each chunk for the blank row that marks end-of-data.

UPDATE: Actually answering the explicit questions:

Q: Does anyone know of a way to write a query that says either;

Q1: 'select everything down and right of B14'?

A1: select * from [Sheet1$B12:] doesn't work. You would have to do ...B12:IV in Excel 2003 and whatever it is in Excel 2007. However you don't need that because you know what your rightmost column is; see below.

Q2: 'select everything in columns B->D'

A2: select * from [Sheet1$B:D]

Q3: 'select B12:D*' where * means 'everything you can'

A3: select * from [Sheet1$B12:D]

Tested with Python 2.5 using the following code:

import win32com.client
import sys
filename, sheetname, range = sys.argv[1:4]
DSN= """
    PROVIDER=Microsoft.Jet.OLEDB.4.0;
    DATA SOURCE=%s;
    Extended Properties='Excel 8.0;READONLY=true;IMEX=1';
    """ % filename
conn = win32com.client.Dispatch("ADODB.Connection")
conn.Open(DSN)
rs = win32com.client.Dispatch("ADODB.Recordset")
sql = (
    "SELECT * FROM [Excel 8.0;HDR=NO;IMEX=1;Database=%s;].[%s$%s]"
    % (filename, sheetname, range)
    )
rs.Open(sql, conn)
nrows = 0
while not rs.EOF:
    nrows += 1
    nf = rs.Fields.Count
    values = [rs.Fields.Item(i).Value for i in xrange(nf)]
    print nrows, values
    if not any(value is not None for value in values):
        print "sentinel found"
        break
    rs.MoveNext()
rs.Close()
conn.Close()
like image 180
John Machin Avatar answered Sep 18 '22 18:09

John Machin


Couple possible solutions:

  1. Put your tables on separate worksheets, then simply query the whole worksheet.
  2. Give each table in Excel a name (in Excel 2007, select the table, right-click, and choose Name a range...), then in your query, use this name instead of "Sheet1$B14:D65535".

Hope that helps.

EDIT

Here's a third idea:

I'm not sure what you're using to query your database, but if your query engine supports variables (like Sql Server, for example) you could store the result of...

SELECT COUNT(*) FROM NameOfServer...Sheet1$

...in a variable called @UsedRowCount, that will give you the number of rows actually used in the worksheet. So, @UsedRowCount = LastRowUsed - InitialBlankRows.

You might then be able to use string concatenation to replace "65535" with @UsedRowCount + @InitialBlankRows. You would have to set @InitialBlankRows to a constant (in your example, it would be 3, since the heading row of the first table is located at Row 4).

like image 20
devuxer Avatar answered Sep 20 '22 18:09

devuxer