Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to extract text from djvu and other ebooks format (possibly in Python) [closed]

I have a collection of ebooks in djvu, pdf, chm format and I am looking for a way to search the keyword in the content. I have been researching around and find couple suggestion to parse pdf content but there seems to be no way to convert the content in djvu into text. By any chance, does anyone know a way to decode djvu content into text so that I can search it easily?

Thanks

like image 558
leon Avatar asked Oct 08 '09 15:10

leon


3 Answers

Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt and grab the output.

The text in a .djvu file is compressed with a djvu specific compression algorithm, bzz, for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.

Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu, which uses djvutxt in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt is not available. So you could use that code if you cannot use djvutxt.

I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre's source:

curl -L http://status.calibre-ebook.com/dist/src | tar xvJ
find . | fgrep djvu

The relevant files are djvu_input.py, djvu.py and djvubzzdec.py

like image 168
Anthon Avatar answered Nov 19 '22 08:11

Anthon


python-djvulibre is a set of Python bindings to the djvulibre open source implementation of djvu -- I haven't tried it, but it looks like it should meet your needs.

like image 38
Alex Martelli Avatar answered Nov 19 '22 08:11

Alex Martelli


Certainly the DjVuLibre SDK will allow access to the text layer -- if it exists (not all DjVu files have a text layer; many are purely raster images).

An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use for this.

[http://dev.caminova.jp/beta/djvu-wic/][1]

like image 1
msr Avatar answered Nov 19 '22 08:11

msr