In brief: how can I export a Google Sheets spreadsheet to an SQLite database without losing the cell-anchored images?
In long: Google Sheets, Excel, and SQLite all allow cell-anchored images. Furthermore, Sheets supports exporting to Excel without loss of such images; and companion programs such as "DB Browser for SQLite", and LibreOffice also support cell-anchored images. However, I have not been able to export a Sheet (or an Excel spreadsheet) to SQLite, though I have tried all the obvious possibilities, and some less obvious ones as well. In the latter category, two attempts are noteworthy:
a) The Python package openpyxl explicitly says "All other workbook / worksheet attributes are not copied - e.g. Images, Charts."
b) Python's pandas
is more promising, because of the dtype
parameter of read_excel
. Supposedly,
specifying this as object
should allow preservation of objects such as cell-anchored images.
Here then is one of my (failed) attempts to use pandas to achieve the desired result:
import sqlite3
import pandas as pd
filename="Test"
con=sqlite3.connect(filename+".db")
wb = pd.read_excel('Test.xlsx',sheet_name = None, header = None, dtype = object)
for sheet in wb:
print(sheet) # Sheet1
# print( wb[sheet].columns )
wb[sheet].to_sql(sheet, con, index=False)
con.commit()
con.close()
Any solution, whether Python-based or not, would be gladly accepted.
I'm aware of several techniques for extracting all the images into separate files but am looking for a fully automated technique (presumably some kind of script) for performing the conversion. Whether or not such a technique extracts the images as an intermediate step is immaterial.
I've also tried adding dtype
specifications in the call to to_sql
, but to no avail.
@Stef's original program requires that the images to be copied are all in named columns, and that these names are either known or can be determined. The first assumption is acceptable, and the second can be relaxed by simply writing:
dtype = object
in the call to read_excel
.
There's no direct way, but you can use openpyxl
version 2.5.5 or later to read images and manually put them into the dataframe.
In the following minimal example I use pandas read_excel
to first get all the data, except the images. The crucial point is to import the image column as object
type in order to be able to assign the images later. Otherwise this empty column will get all NaN
s and a float data type.
Then we read the images from Excel using openpyxl
and import them into the dataframe. The ref
attribute of the image holds an _io.BytesIO
stream. Its pointer points to the end (EOF) after loading the workbook, so we'll have to rewind it first (img.ref.seek(0)
). (btw, there seems to be a bug in the img.path
names in openpyxl
: I get the same path /xl/media/image1.png
for all three images whereas it is image{1,2,3}.png
in the xlsx).
Anchor row/column values are zero based (img.anchor.idx_base == 0
), so we have to account for the header row when computing the iat
position in the dataframe (and possible index columns if any). Finally we export the dataframe to SQL using to_sql
.
import pandas as pd
import openpyxl
import sqlite3
file_name = 'so58068593.xlsx'
sheet_name = 'Tabelle1'
# read data into dataframe
df = pd.read_excel(file_name, sheet_name=sheet_name, dtype=object)
# read images and add them to dataframe
wb = openpyxl.load_workbook(file_name)
ws = wb[sheet_name]
for img in ws._images:
img.ref.seek(0)
df.iat[img.anchor.to.row-1, img.anchor.to.col] = img.ref.read()
# export to sqlite
with sqlite3.connect(file_name + ".db") as con:
df.to_sql(sheet_name, con=con)
Excel file (images taken from Wikipedia):
SQLite database viewed in DB Browser for SQLite:
This is just a minimal example. If you don't know in advance where the images are in your xlsx
file, you could first iterate over the images
collection of the worksheet and check which columns/rows you need for the images in your dataframe, then append them to the dataframe (if not already there) and only then assign the images. Please note, however, that in xlsx
you can have data in a cell and at the same time an image achored to this cell, which of course can't be mapped to a database table or pandas dataframe. The reason is that images are not the content of a cell but just anchored to this cell (you could even have several images anchored to the same cell).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With