Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search MS word files in a directory for specific content in Linux

I have a directory structure full of MS word files and I have to search the directory for particular string. Until now I was using the following command to search files for in a directory

find . -exec grep -li 'search_string' {} \;

find . -name '*' -print | xargs grep 'search_string'

But, this search doesn't work for MS word files.

Is it possible to do string search in MS word files in Linux?

like image 538
JoshMachine Avatar asked Jul 12 '12 23:07

JoshMachine


People also ask

How do I search for a specific word in a directory in Linux?

Grep is a Linux / Unix command-line tool used to search for a string of characters in a specified file. The text search pattern is called a regular expression. When it finds a match, it prints the line with the result. The grep command is handy when searching through large log files.

How do I search for a specific word in multiple files in Linux?

A simple way to work this out is by using grep pattern searching tool, is a powerful, efficient, reliable and most popular command-line utility for finding patterns and words from files or directories on Unix-like systems.

Can you search for a word document by content?

To open the Find pane from the Edit View, press Ctrl+F, or click Home > Find. Find text by typing it in the Search the document for… box. Word Web App starts searching as soon as you start typing.


2 Answers

I'm a translator and know next to nothing about scripting but I was so pissed off about grep not being able to scan inside Word .doc files that I worked out how to make this little shell script to use catdoc and grep to search a directory of .doc files for a given input string.

You need to install catdocand docx2txt packages

#!/bin/bash
   echo -e "\n
Welcome to scandocs. This will search .doc AND .docx files in this directory for a given string. \n
Type in the text string you want to find... \n"
   read response
   find . -name "*.doc" | 
       while read i; do catdoc "$i" | 
                 grep --color=auto -iH --label="$i" "$response"; done
   find . -name "*.docx" | 
       while read i; do docx2txt < "$i" | 
                 grep --color=auto -iH --label="$i" "$response"; done

All improvements and suggestions welcome!

like image 64
Ralph Avatar answered Oct 25 '22 20:10

Ralph


Here's a way to use "unzip" to print the entire contents to standard output, then pipe to "grep -q" to detect whether the desired string is present in the output. It works for docx format files.

#!/bin/bash
PROG=`basename $0`

if [ $# -eq 0 ]
then
  echo "Usage: $PROG string file.docx [file.docx...]"
  exit 1
fi

findme="$1"
shift

for file in $@
do
  unzip -p "$file" | grep -q "$findme"
  [ $? -eq 0 ] && echo "$file"
done

Save the script as "inword" and search for "wombat" in three files with:

$ ./inword wombat file1.docx file2.docx file3.docx
file2.docx

Now you know file2.docx contains "wombat". You can get fancier by adding support for other grep options. Have fun.

like image 22
DanB Avatar answered Oct 25 '22 20:10

DanB