Anyone know of anything they can recommend in order to extract just the plain text from a <code>.doc</code> or <code>.docx</code>? I've found this - wondered if there were any other suggestions?

If you want the pure plain text(my requirement) then all you need is <pre class="prettyprint"><code>unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' </code></pre> Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

How to extract just plain text from .doc & .docx files? [closed]

1 Answers

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

103

answered Sep 17 '22 08:09

rob

Related questions
                            
                                How to set font color for STDOUT and STDERR
                            
                                How to skip the cron job in saturday and sunday in linux? [closed]
                            
                                How to include .htaccess in tar commands? [closed]
                            
                                check duration of audio files on the command-line
                            
                                How to check if the variable value in AWK script is null or empty?
                            
                                UNIX, get environment variable
                            
                                gpg decryption fails with no secret key error
                            
                                How do I get GMT time in Unix? [closed]
                            
                                Commenting out a set of lines in a shell script
                            
                                Randomly Pick Lines From a File Without Slurping It With Unix
                            
                                When does a UNIX directory change its timestamp
                            
                                Difference between Cron and Crontab?
                            
                                Is there a grep equivalent for find's -print0 and xargs's -0 switches?
                            
                                Unset an environment variable for a single command
                            
                                Expression after last specific character
                            
                                Why does RSA encrypted text give me different results for the same text
                            
                                What is an easy way to do a sorted diff between two files?
                            
                                less-style markdown viewer for UNIX systems
                            
                                How to pipe the results of 'find' to mv in Linux
                            
                                How to count occurrences of a word in all the files of a directory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract just plain text from .doc & .docx files? [closed]

Tags:

unix

extract

text-extraction

docx

doc

docextract

People also ask

1 Answers

rob

Recent Activity

Donate For Us