Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert OpenDocument spreadsheets to a pandas DataFrame?

The Python library pandas can read Excel spreadsheets and convert them to a pandas.DataFrame with pandas.read_excel(file) command. Under the hood, it uses xlrd library which does not support ods files.

Is there an equivalent of pandas.read_excel for ods files? If not, how can I do the same for an Open Document Formatted spreadsheet (ods file)? ODF is used by LibreOffice and OpenOffice.

like image 494
Lamps1829 Avatar asked Jul 24 '13 13:07

Lamps1829


People also ask

Can Python read ODS file?

You can read ODF (Open Document Format . ods ) documents in Python using the following modules: odfpy / read-ods-with-odfpy. ezodf.


3 Answers

This is available natively in pandas 0.25. So long as you have odfpy installed (conda install odfpy OR pip install odfpy) you can do

pd.read_excel("the_document.ods", engine="odf")
like image 123
Will Ayd Avatar answered Oct 21 '22 21:10

Will Ayd


You can read ODF (Open Document Format .ods) documents in Python using the following modules:

  • odfpy / read-ods-with-odfpy
  • ezodf
  • pyexcel / pyexcel-ods
  • py-odftools
  • simpleodspy

Using ezodf, a simple ODS-to-DataFrame converter could look like this:

import pandas as pd
import ezodf

doc = ezodf.opendoc('some_odf_spreadsheet.ods')

print("Spreadsheet contains %d sheet(s)." % len(doc.sheets))
for sheet in doc.sheets:
    print("-"*40)
    print("   Sheet name : '%s'" % sheet.name)
    print("Size of Sheet : (rows=%d, cols=%d)" % (sheet.nrows(), sheet.ncols()) )

# convert the first sheet to a pandas.DataFrame
sheet = doc.sheets[0]
df_dict = {}
for i, row in enumerate(sheet.rows()):
    # row is a list of cells
    # assume the header is on the first row
    if i == 0:
        # columns as lists in a dictionary
        df_dict = {cell.value:[] for cell in row}
        # create index for the column headers
        col_index = {j:cell.value for j, cell in enumerate(row)}
        continue
    for j, cell in enumerate(row):
        # use header instead of column index
        df_dict[col_index[j]].append(cell.value)
# and convert to a DataFrame
df = pd.DataFrame(df_dict)

P.S.

  • ODF spreadsheet (*.ods files) support has been requested on the pandas issue tracker: https://github.com/pydata/pandas/issues/2311, but it is still not implemented.

  • ezodf was used in the unfinished PR9070 to implement ODF support in pandas. That PR is now closed (read the PR for a technical discussion), but it is still available as an experimental feature in this pandas fork.

  • there are also some brute force methods to read directly from the XML code (here)
like image 19
davidovitch Avatar answered Oct 21 '22 23:10

davidovitch


Here is a quick and dirty hack which uses ezodf module:

import pandas as pd
import ezodf

def read_ods(filename, sheet_no=0, header=0):
    tab = ezodf.opendoc(filename=filename).sheets[sheet_no]
    return pd.DataFrame({col[header].value:[x.value for x in col[header+1:]]
                         for col in tab.columns()})

Test:

In [92]: df = read_ods(filename='fn.ods')

In [93]: df
Out[93]:
     a    b    c
0  1.0  2.0  3.0
1  4.0  5.0  6.0
2  7.0  8.0  9.0

NOTES:

  • all other useful parameters like header, skiprows, index_col, parse_cols are NOT implemented in this function - feel free to update this question if you want to implement them
  • ezodf depends on lxml make sure you have it installed
like image 11
MaxU - stop WAR against UA Avatar answered Oct 21 '22 23:10

MaxU - stop WAR against UA