Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

importing a spreadsheet with cell-anchored images into an SQLite database

In brief: how can I export a Google Sheets spreadsheet to an SQLite database without losing the cell-anchored images?

In long: Google Sheets, Excel, and SQLite all allow cell-anchored images. Furthermore, Sheets supports exporting to Excel without loss of such images; and companion programs such as "DB Browser for SQLite", and LibreOffice also support cell-anchored images. However, I have not been able to export a Sheet (or an Excel spreadsheet) to SQLite, though I have tried all the obvious possibilities, and some less obvious ones as well. In the latter category, two attempts are noteworthy:

a) The Python package openpyxl explicitly says "All other workbook / worksheet attributes are not copied - e.g. Images, Charts."

b) Python's pandas is more promising, because of the dtype parameter of read_excel. Supposedly, specifying this as object should allow preservation of objects such as cell-anchored images.

Here then is one of my (failed) attempts to use pandas to achieve the desired result:

import sqlite3
import pandas as pd

filename="Test"
con=sqlite3.connect(filename+".db")

wb = pd.read_excel('Test.xlsx',sheet_name = None, header = None, dtype = object)

for sheet in wb:
    print(sheet)  # Sheet1
    # print( wb[sheet].columns )
    wb[sheet].to_sql(sheet, con, index=False)
con.commit()
con.close()

Any solution, whether Python-based or not, would be gladly accepted.

Clarification

I'm aware of several techniques for extracting all the images into separate files but am looking for a fully automated technique (presumably some kind of script) for performing the conversion. Whether or not such a technique extracts the images as an intermediate step is immaterial.

I've also tried adding dtype specifications in the call to to_sql, but to no avail.

Addendum

@Stef's original program requires that the images to be copied are all in named columns, and that these names are either known or can be determined. The first assumption is acceptable, and the second can be relaxed by simply writing:

dtype = object

in the call to read_excel.

like image 345
peak Avatar asked Oct 15 '22 11:10

peak


1 Answers

There's no direct way, but you can use openpyxl version 2.5.5 or later to read images and manually put them into the dataframe.

In the following minimal example I use pandas read_excel to first get all the data, except the images. The crucial point is to import the image column as object type in order to be able to assign the images later. Otherwise this empty column will get all NaNs and a float data type.
Then we read the images from Excel using openpyxl and import them into the dataframe. The ref attribute of the image holds an _io.BytesIO stream. Its pointer points to the end (EOF) after loading the workbook, so we'll have to rewind it first (img.ref.seek(0)). (btw, there seems to be a bug in the img.path names in openpyxl: I get the same path /xl/media/image1.png for all three images whereas it is image{1,2,3}.png in the xlsx).
Anchor row/column values are zero based (img.anchor.idx_base == 0), so we have to account for the header row when computing the iat position in the dataframe (and possible index columns if any). Finally we export the dataframe to SQL using to_sql.

import pandas as pd
import openpyxl
import sqlite3

file_name = 'so58068593.xlsx'
sheet_name = 'Tabelle1'

# read data into dataframe
df = pd.read_excel(file_name, sheet_name=sheet_name, dtype=object)

# read images and add them to dataframe
wb = openpyxl.load_workbook(file_name)
ws = wb[sheet_name]
for img in ws._images:
    img.ref.seek(0)
    df.iat[img.anchor.to.row-1, img.anchor.to.col] = img.ref.read()

# export to sqlite
with sqlite3.connect(file_name + ".db") as con:
    df.to_sql(sheet_name, con=con)

Excel file (images taken from Wikipedia):

enter image description here

SQLite database viewed in DB Browser for SQLite:

enter image description here

This is just a minimal example. If you don't know in advance where the images are in your xlsx file, you could first iterate over the images collection of the worksheet and check which columns/rows you need for the images in your dataframe, then append them to the dataframe (if not already there) and only then assign the images. Please note, however, that in xlsx you can have data in a cell and at the same time an image achored to this cell, which of course can't be mapped to a database table or pandas dataframe. The reason is that images are not the content of a cell but just anchored to this cell (you could even have several images anchored to the same cell).

like image 139
Stef Avatar answered Oct 18 '22 11:10

Stef