Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert .docx to .txt in Python

Tags:

python

ms-word

I would like to convert a large batch of MS Word files into the plain text format. I have no idea how to do it in Python. I found the following code online. My path is local and all file names are like cx-xxx (i.e. c1-000, c1-001, c2-000, c2-001 etc.):

from docx import [name of file]
import io
import shutil
import os

def convertDocxToText(path):
for d in os.listdir(path):
    fileExtension=d.split(".")[-1]
    if fileExtension =="docx":
        docxFilename = path + d
        print(docxFilename)
        document = Document(docxFilename)
        textFilename = path + d.split(".")[0] + ".txt"
        with io.open(textFilename,"c", encoding="utf-8") as textFile:
            for para in document.paragraphs: 
                textFile.write(unicode(para.text))

path= "/home/python/resumes/"
convertDocxToText(path)
like image 745
gabgabhouse Avatar asked Jun 07 '26 14:06

gabgabhouse


2 Answers

Convert docx to txt with pypandoc:

import pypandoc

# Example file:
docxFilename = 'somefile.docx'
output = pypandoc.convert_file(docxFilename, 'plain', outputfile="somefile.txt")
assert output == ""

See the official documentation here:

https://pypi.org/project/pypandoc/

like image 128
Gustav Rasmussen Avatar answered Jun 10 '26 03:06

Gustav Rasmussen


You can also use the library docx2txt in Python. Here's an example:

I use glob to iter over all DOCX files in the folder. Note: I use a little list comprehension on the original name in order to re-use it in the TXT filename.

If there's anything I've forgotten to explain, tag me and I'll edit it in.

import docx2txt
import glob

directory = glob.glob('C:/folder_name/*.docx')

for file_name in directory:
    with open(file_name, 'rb') as infile:
        with open(file_name[:-5]+'.txt', 'w', encoding='utf-8') as outfile:
            doc = docx2txt.process(infile)
            outfile.write(doc)

print("=========")
print("All done!")
like image 26
MJM Avatar answered Jun 10 '26 02:06

MJM