I am given the an excel file which contains some text formatting. Some can be bold, some italic, some are supercase1, and some other formats (but not as many as the three mentioned).
Examples:
Now, since this cell is to be made as dictionary (real, human, dictionary) database entry, I would like to retain the format of the cell, as it will be beneficial to tell the usage of the word (such as bold in the above case indicating the word type: v (verb) and italic indicating new section).
But it is all in the excel cell.
When I try to simply read the excel file directly using database tool like Toad for Oracle, the format is gone!
<b>v</b>
and that will be my work. I only want to know how we retain or detect the excel cell text format in Python. (in particular are these three formats: bold, italic, and supercase)Edit:
I try to get the text format with xlrd package, but I can't seem to find a way to get the text format style as the cell
object only consists of: ctype
, value
, and xf_index
. It has no info about the text format, and when I create the instance with the formatting_info=True
:
book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
formatting_info=True, on_demand=False, ragged_rows=False)
I got the following error:
NotImplementedError: formatting_info=True not yet implemented
Raised by this line in the xlsx.py
file of the xlrd
package:
if formatting_info:
raise NotImplementedError("formatting_info=True not yet implemented")
Which I found it strange, since I use version 0.9.4 xlrd (latest) and the documentation says that since version above 0.6.1, the formatting info is included:
Default Formatting
Default formatting is applied to all empty cells (those not described by a cell record). Firstly row default information (ROW record, Rowinfo class) is used if available. Failing that, column default information (COLINFO record, Colinfo class) is used if available. As a last resort the worksheet/workbook default cell format will be used; this should always be present in an Excel file, described by the XF record with the fixed index 15 (0-based). By default, it uses the worksheet/workbook default cell style, described by the very first XF record (index 0). Formatting features not included in xlrd version 0.6.1
Rich text i.e. strings containing partial bold italic and underlined text, change of font inside a string, etc. See OOo docs s3.4 and s3.2 Asian phonetic text (known as "ruby"), used for Japanese furigana. See OOo docs s3.4.2 (p15) Conditional formatting. See OOo docs s5.12, s6.21 (CONDFMT record), s6.16 (CF record) Miscellaneous sheet-level and book-level items e.g. printing layout, screen panes. Modern Excel file versions don't keep most of the built-in "number formats" in the file; Excel loads formats according to the user's locale. Currently xlrd's emulation of this is limited to a hard-wired table that applies to the US English locale. This may mean that currency symbols, date order, thousands separator, decimals separator, etc are inappropriate. Note that this does not affect users who are copying XLS files, only those who are visually rendering cells.
Did I make any mistake here? My code is simply as shown:
book = xlrd.open_workbook("HuluHalaDict.xlsx", sys.stdout, 0, xlrd.USE_MMAP, None, None, \
formatting_info=True, on_demand=False, ragged_rows=False)
Edit 2:
The example shown in the post shows that it creates the class instance (book
) with formatting_info=True
. But I check it in my implementation. It raises the error above. Any idea?
To read an excel file as a DataFrame, use the pandas read_excel() method. You can read the first sheet, specific sheets, multiple sheets or all sheets. Pandas converts this to the DataFrame structure, which is a tabular like structure.
Openpyxl is a Python library for reading and writing Excel (with extension xlsx/xlsm/xltx/xltm) files. The openpyxl module allows Python program to read and modify Excel files.
I suggest you the library xlrd https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966
On GitHub here https://github.com/python-excel/xlrd
You can find an easy example on how to use xlrd to determine the font style here Using XLRD module and Python to determine cell font style (italics or not)
Here a practical example:
from xlrd import open_workbook
path = '/Users/.../Desktop/Workbook1.xls'
wb = open_workbook(path, formatting_info=True)
sheet = wb.sheet_by_name("Sheet1")
cell = sheet.cell(0, 0) # The first cell
print("cell.xf_index is", cell.xf_index)
fmt = wb.xf_list[cell.xf_index]
print("type(fmt) is", type(fmt))
print("Dumped Info:")
fmt.dump()
It outputs the following:
cell.xf_index is 62
type(fmt) is <class 'xlrd.formatting.XF'>
Dumped Info:
_alignment_flag: 0
_background_flag: 0
_border_flag: 0
_font_flag: 1
_format_flag: 0
_protection_flag: 0
alignment (XFAlignment object):
hor_align: 0
indent_level: 0
rotation: 0
shrink_to_fit: 0
text_direction: 0
text_wrapped: 0
vert_align: 2
background (XFBackground object):
background_colour_index: 65
fill_pattern: 0
pattern_colour_index: 64
border (XFBorder object):
bottom_colour_index: 0
bottom_line_style: 0
diag_colour_index: 0
diag_down: 0
diag_line_style: 0
diag_up: 0
left_colour_index: 0
left_line_style: 0
right_colour_index: 0
right_line_style: 0
top_colour_index: 0
top_line_style: 0
font_index: 6
format_key: 0
is_style: 0
lotus_123_prefix: 0
parent_style_index: 0
protection (XFProtection object):
cell_locked: 1
formula_hidden: 0
xf_index: 62
Where _font_flag: 1
indicates that is Bold
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With