I need an epub to text solution in Python

Question

I need to get text from an epub

from epub_conversion.utils import open_book, convert_epub_to_lines

f = open("demofile.txt", "a")
book = open_book("razvansividra.epub")
lines = convert_epub_to_lines(book)

I use this but if I use print(lines) it does print only one line. And the library is 6 years old. Do you guys know a good way ?

denis_lor · Accepted Answer

What about https://github.com/aerkalov/ebooklib

EbookLib is a Python library for managing EPUB2/EPUB3 and Kindle files. It's capable of reading and writing EPUB files programmatically (Kindle support is under development).

The API is designed to be as simple as possible, while at the same time making complex things possible too. It has support for covers, table of contents, spine, guide, metadata and etc.

import ebooklib
from ebooklib import epub

book = epub.read_epub('test.epub')

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    print doc

emdou · Answer

Here is a sloppy script that extracts the text from an .epub in the right order. Improvements could be made

Quick explanation:

Takes input(epub) and output(txt) file paths as first and second arguments
Extracts epub content in temporary directory
Parses 'content.opf' file for xhtml content and order
Extracts text from each xhtml

Dependency: lxml

#!/usr/bin/python3
import shutil, os, sys, zipfile, tempfile
from lxml import etree

if len(sys.argv) != 3:
    print(f"Usage: {sys.argv[0]} <input.epub> <output.txt>")
    exit(1)

inputFilePath=sys.argv[1]
outputFilePath=sys.argv[2]

print(f"Input: {inputFilePath}")
print(f"Output: {outputFilePath}")

with tempfile.TemporaryDirectory() as tmpDir:
    print(f"Extracting input to temp directory '{tmpDir}'.")
    with zipfile.ZipFile(inputFilePath, 'r') as zip_ref:
        zip_ref.extractall(tmpDir)

    with open(outputFilePath, "w") as outFile:
        print(f"Parsing 'container.xml' file.")
        containerFilePath=f"{tmpDir}/META-INF/container.xml"
        tree = etree.parse(containerFilePath)
        for rootFilePath in tree.xpath( "//*[local-name()='container']"
                                        "/*[local-name()='rootfiles']"
                                        "/*[local-name()='rootfile']"
                                        "/@full-path"):
            print(f"Parsing '{rootFilePath}' file.")
            contentFilePath = f"{tmpDir}/{rootFilePath}"
            contentFileDirPath = os.path.dirname(contentFilePath)

            tree = etree.parse(contentFilePath)
            for idref in tree.xpath("//*[local-name()='package']"
                                    "/*[local-name()='spine']"
                                    "/*[local-name()='itemref']"
                                    "/@idref"):
                for href in tree.xpath( f"//*[local-name()='package']"
                                        f"/*[local-name()='manifest']"
                                        f"/*[local-name()='item'][@id='{idref}']"
                                        f"/@href"):
                    outFile.write("
")
                    xhtmlFilePath = f"{contentFileDirPath}/{href}"
                    subtree = etree.parse(xhtmlFilePath, etree.HTMLParser())
                    for ptag in subtree.xpath("//html/body/*"):
                        for text in ptag.itertext():
                            outFile.write(f"{text}")
                        outFile.write("
")

print(f"Text written to '{outputFilePath}'.")

I need an epub to text solution in Python

Tags:

python

epub

Adrian

2 Answers

denis_lor

emdou

Recent Activity

Donate For Us

I need an epub to text solution in Python

Tags:

python

epub

Adrian

2 Answers

denis_lor

emdou

Related questions

Recent Activity

Donate For Us