Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse .doc & .docx for get all text using golang?

Tags:

ms-word

go

docx

doc

How can I parse word documents ".doc", ".docx" to get all the text using golang?

like image 591
Alexander Barac Avatar asked Mar 10 '23 20:03

Alexander Barac


1 Answers

You can get some inspiration from those projects:

https://github.com/nguyenthenguyen/docx
https://github.com/opencontrol/doc-template

Basically, DOCX is a Zip file with XMLs in it. All the texts are inside document.xml

What both project do is remove all XML tags, leaving only text intact. You should see if that approach suits you too.

like image 94
Alexey Soshin Avatar answered Mar 19 '23 07:03

Alexey Soshin