I have a collection of ebooks in djvu, pdf, chm format and I am looking for a way to search the keyword in the content. I have been researching around and find couple suggestion to parse pdf content but there seems to be no way to convert the content in djvu into text. By any chance, does anyone know a way to decode djvu content into text so that I can search it easily?
Thanks
Assuming the djvu files contain OCR-ed text, a fast way on Linux to get that out is to use Popen to run djvutxt
and grab the output.
The text in a .djvu
file is compressed with a djvu
specific compression algorithm, bzz
, for which no simple C interface exists which you could load as an shared object in Python. It is a C++ implementation based on some framework.
Shameless self promotion: I contributed to Calibre the conversion from OCR-ed .djvu
, which uses djvutxt
in this way. However it falls back to my pure python decoder implementation (sloooow) if djvutxt
is not available. So you could use that code if you cannot use djvutxt
.
I have not yet put out the Python source seperately from Calibre. But after downloading and extracting Calibre's source:
curl -L http://status.calibre-ebook.com/dist/src | tar xvJ
find . | fgrep djvu
The relevant files are djvu_input.py
, djvu.py
and djvubzzdec.py
python-djvulibre is a set of Python bindings to the djvulibre open source implementation of djvu -- I haven't tried it, but it looks like it should meet your needs.
Certainly the DjVuLibre SDK will allow access to the text layer -- if it exists (not all DjVu files have a text layer; many are purely raster images).
An alternative solution might be to base your index on IIS technology. CamiNova has a free IFilter that you can use for this.
[http://dev.caminova.jp/beta/djvu-wic/][1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With