Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OneNote parsing - how to get to the Text Blobs in the document?

I am creating a parser for the .one file extension, which when finished I will add to the Apache Tika project.

Here is the APL 2.0 licensed Open Source project I'm creating: https://github.com/nddipiazza/onenote-parser-java

I used the specification document here: https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-one/73d22548-a613-4350-8c23-07d15576be50

As a starting point, I ported over the code from this open source C++ project: https://github.com/dropbox/onenote-parser

I have gotten a long way in the parsing of the documents, but I've hit a road block.

Here is the OneNote file I'm using to parse: https://drive.google.com/file/d/1uROTEnKeBKU08CG_K5zdDTGHa178LgBK/view?usp=sharing

Here is the section from this document

I am unable to view the Section1TextArea1 and Section1TextArea2 in my parsed results. So I'm missing some sort of key data parsing element or something.

It is definitely in the OneNote file itself. I can see it in the Hex viewer:

hex editor view of the content

Here is the JSON parse output: https://gist.github.com/nddipiazza/02d2252d357b3b02a6b9ab1050474267

I feel like the spec document is missing some very important information needed in order to parse this proprietary format.

What major element(s) am I missing resulting in me not getting the actual text content?

like image 309
Nicholas DiPiazza Avatar asked Nov 23 '19 13:11

Nicholas DiPiazza


Video Answer


1 Answers

I figured it out. It was a matter of understanding that property values in OneNote can have either:

  • Binary contents
  • Ascii text contents
  • UTF-16LE contents.

There is a variety of them sprinkled throughout.

Also I just went ahead and parse the entire root file tree. It will result in lots of duplicate text but i don't really care.

The project is updated with test cases and the fix here: https://github.com/nddipiazza/onenote-parser-java/tree/master/src/main/java/org/apache/tika/onenote

UPDATE:

Just created the apache tika PR: https://github.com/apache/tika/pull/300

like image 69
Nicholas DiPiazza Avatar answered Sep 27 '22 20:09

Nicholas DiPiazza