Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting "Ole::Storage::FormatError: OLE2 signature is invalid" when trying to get content out of a Word doc

I'm using Rails 5. I want to get text out of a Word document (.doc) so I'm using this code

  text = nil
  MSWordDoc::Extractor.load(file_location) do |ctl00_MainContent_List1_grdData|
    text = contents.whole_contents
  end

but I'm getting the error below. I have this gem in my Gemfile

gem 'msworddoc-extractor'

What else do I need to do to get the content out of a Word doc? It would be great if I could apply the same code to .docx files as I do to .doc files.

/Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/support.rb:201: warning: constant ::Fixnum is deprecated
Ole::Storage::FormatError: OLE2 signature is invalid
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:378:in `validate!'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:370:in `initialize'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:112:in `new'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:112:in `load'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:79:in `initialize'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:85:in `new'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/ruby-ole-1.2.12/lib/ole/storage/base.rb:85:in `open'
    from /Users/davea/.rvm/gems/ruby-2.4.0/gems/msworddoc-extractor-0.2.0/lib/msworddoc/extractor.rb:11:in `load'
    from /Users/davea/Documents/workspace/myproject/app/services/msword_processor_service.rb:12:in `pre_process_data'
    from /Users/davea/Documents/workspace/myproject/app/services/abstract_import_service.rb:88:in `process_race_data'
    from (irb):2
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'
    from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'
    from bin/rails:4:in `require'
    from bin/rails:4:in `<main>'
like image 851
Dave Avatar asked Mar 19 '17 21:03

Dave


1 Answers

The gem that you are using has the gem ruby-ole as a dependency. You can see it in the code:

ole = Ole::Storage.open(file)

When you import your Word document it is really being opened by the ruby-ole gem. That gem will raise an exception if it cannot validate that the file is the proper format:

raise FormatError, "OLE2 signature is invalid" unless magic == MAGIC

MAGIC refers to the header of the .doc file, which should look like this:

# i have seen it pointed out that the first 4 bytes of hex,
# 0xd0cf11e0, is supposed to spell out docfile. hmmm :)
MAGIC = "\xd0\xcf\x11\xe0\xa1\xb1\x1a\xe1"  # expected value of Header#magic

This refers to the CFBF header format for Word documents:

BYTE _abSig[8];             // [00H,08] {0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1,
                            // 0x1a, 0xe1} for current version

Either your .doc file is not a valid Word document, or it was made by a newer version of Word that is not supported by the ruby-ole gem.

I recommend retrying the operation with several different Word documents to find a compatible type, then re-save your original document in that format to try again.

like image 94
anothermh Avatar answered Nov 08 '22 00:11

anothermh