Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_excel with Hyperlink

Tags:

pandas

excel

I have an Excel spreadsheet that I am reading into a Pandas DataFrame:

df = pd.read_excel("file.xls")

However, one of the columns of the spreadsheet contains text which have a hyperlink associated with it. How do I access the underlying hyperlink in Pandas?

like image 818
slaw Avatar asked Feb 10 '16 21:02

slaw


2 Answers

Quick monkey patching, without converters or anything like this, if you would like to treat ALL cells with hyperlinks as hyperlinks, more sophisticated way, I suppose, at least be able to choose, what columns treat as hyperlinked or gather data, or save somehow both data and hyperlink in same cell at dataframe. And using converters, dunno. (BTW I played also with data_only, keep_links, did not helped, only changing read_only resulted ok, I suppose it can slow down your code speed).

P.S.: Works only with xlsx, i.e., engine is openpyxl

P.P.S.: If you reading this comment in the future and issue https://github.com/pandas-dev/pandas/issues/13439 still Open, don't forget to see changes in _convert_cell and load_workbook at pandas.io.excel._openpyxl and update them accordingly.

import pandas
from pandas.io.excel._openpyxl import OpenpyxlReader
import numpy as np
from pandas._typing import FilePathOrBuffer, Scalar


def _convert_cell(self, cell, convert_float: bool) -> Scalar:
    from openpyxl.cell.cell import TYPE_BOOL, TYPE_ERROR, TYPE_NUMERIC
    # here we adding this hyperlink support:
    if cell.hyperlink and cell.hyperlink.target:
        return cell.hyperlink.target
        # just for example, you able to return both value and hyperlink,
        # comment return above and uncomment return below
        # btw this may hurt you on parsing values, if symbols "|||" in value or hyperlink.
        # return f'{cell.value}|||{cell.hyperlink.target}'
    # here starts original code, except for "if" became "elif"
    elif cell.is_date:
        return cell.value
    elif cell.data_type == TYPE_ERROR:
        return np.nan
    elif cell.data_type == TYPE_BOOL:
        return bool(cell.value)
    elif cell.value is None:
        return ""  # compat with xlrd
    elif cell.data_type == TYPE_NUMERIC:
        # GH5394
        if convert_float:
            val = int(cell.value)
            if val == cell.value:
                return val
        else:
            return float(cell.value)

    return cell.value


def load_workbook(self, filepath_or_buffer: FilePathOrBuffer):
    from openpyxl import load_workbook
    # had to change read_only to False:
    return load_workbook(
        filepath_or_buffer, read_only=False, data_only=True, keep_links=False
    )


OpenpyxlReader._convert_cell = _convert_cell
OpenpyxlReader.load_workbook = load_workbook

And after adding this above in your python file, you will be able to call df = pandas.read_excel(input_file)

After writing all this stuff it came to me, that maybe it would be easier and cleaner just use openpyxl by itself ^_^

like image 66
Hellohowdododo Avatar answered Sep 24 '22 00:09

Hellohowdododo


This can be done with openpyxl, I'm not sure its possible with Pandas at all. Here's how I've done it:

import openpyxl

wb = openpyxl.load_workbook('yourfile.xlsm')
ws = wb.get_sheet_by_name('Sheet1')
print(ws.cell(row=2, column=1).hyperlink.target)

You can also use iPython, and set a variable equal to the hyperlink object:

t = ws.cell(row=2, column=1).hyperlink

then do t. and press tab to see all the options for what you can do with or access from the object.

like image 41
wordsforthewise Avatar answered Sep 23 '22 00:09

wordsforthewise