I tried reading a .doc file like -
with open('file.doc', errors='ignore') as f:
text = f.read()
It did read that file but with huge junk, I can't remove that junk as I don't know from where it starts and where it ends.
I also tried installing textract module which says it can read from any file format but there were many dependency issues while downloading it in Windows.
So I alternately did this with antiword command line utility, my answer is below.
You can use antiword command line utility to do this, I know most of you would have tried it but still I wanted to share.
antiword from here
antiword folder to C:\ and add the path C:\antiword to your PATH environment variable.Here is a sample of how to use it, handling docx and doc files:
import os, docx2txt
def get_doc_text(filepath, file):
if file.endswith('.docx'):
text = docx2txt.process(file)
return text
elif file.endswith('.doc'):
# converting .doc to .docx
doc_file = filepath + file
docx_file = filepath + file + 'x'
if not os.path.exists(docx_file):
os.system('antiword ' + doc_file + ' > ' + docx_file)
with open(docx_file) as f:
text = f.read()
os.remove(docx_file) #docx_file was just to read, so deleting
else:
# already a file with same name as doc exists having docx extension,
# which means it is a different file, so we cant read it
print('Info : file with same name of doc exists having docx extension, so we cant read it')
text = ''
return text
Now call this function:
filepath = "D:\\input\\"
files = os.listdir(filepath)
for file in files:
text = get_doc_text(filepath, file)
print(text)
This could be good alternate way to read .doc file in Python on Windows.
Hope it helps, Thanks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With