Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multiple .doc to .docx file conversion using python

Tags:

python

.doc

I want to convert all the .doc files from a particular folder to .docx file.

I tried using the following code,

import subprocess
import os
for filename in os.listdir(os.getcwd()):
    if filename.endswith('.doc'):
        print filename
        subprocess.call(['soffice', '--headless', '--convert-to', 'docx', filename])

But it gives me an error: OSError: [Errno 2] No such file or directory

like image 851
sunil pawar Avatar asked Jul 19 '16 21:07

sunil pawar


People also ask

How do I convert multiple Word documents to docx?

Firstly, arrange all doc or docx files in one folder. Secondly press “Ctrl” and click to select all files. Next right click and choose “Save As” on the menu. Now there shall be multiple “Save As” windows popping up, so choose doc or docx for file type accordingly.

Does python-docx work with DOC?

Word documents contain formatted text wrapped within three object levels. Lowest level- Run objects, Middle level- Paragraph objects and Highest level- Document object. So, we cannot work with these documents using normal text editors. But, we can manipulate these word documents in python using the python-docx module.

What is docx module in python?

Release v0.8.11 (Installation) python-docx is a Python library for creating and updating Microsoft Word (. docx) files.

Which is better DOC or docx?

DOCX is definitely the better option compared to DOC. The newer format creates smaller, lighter, and easier to open, read, and transfer files. It is also easier to repair a damaged .


3 Answers

Here is a solution that worked for me. The other solutions proposed did not work on my Windows 10 machine using Python 3.

from glob import glob
import re
import os
import win32com.client as win32
from win32com.client import constants

# Create list of paths to .doc files
paths = glob('C:\\path\\to\\doc\\files\\**\\*.doc', recursive=True)

def save_as_docx(path):
    # Opening MS Word
    word = win32.gencache.EnsureDispatch('Word.Application')
    doc = word.Documents.Open(path)
    doc.Activate ()

    # Rename path with .docx
    new_file_abs = os.path.abspath(path)
    new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)

    # Save and Close
    word.ActiveDocument.SaveAs(
        new_file_abs, FileFormat=constants.wdFormatXMLDocument
    )
    doc.Close(False)

for path in paths:
    save_as_docx(path)
like image 63
dshefman Avatar answered Nov 02 '22 12:11

dshefman


I prefer to use the glob module for tasks like that. Put this in a file doc2docx.py. To make it executable, set chmod +x. And optionally put that file in your $PATH as well, to make it available "everywhere".

#!/usr/bin/env python

import glob
import subprocess

for doc in glob.iglob("*.doc"):
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

Though ideally you'd leave the expansion to the shell itself, and call doc2docx.py with the files as arguments, like doc2docx.py *.doc:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', doc])

As requested by @pyd, to output to a target directory myoutputdir use:

#!/usr/bin/env python

import subprocess
import sys

if len(sys.argv) < 2:
    sys.stderr.write("SYNOPSIS: %s file1 [file2] ...\n"%sys.argv[0])

for doc in sys.argv[1:]:
    subprocess.call(['soffice', '--headless', '--convert-to', 'docx', '--outdir', 'myoutputdir', doc])
like image 30
Jan Christoph Terasa Avatar answered Nov 02 '22 14:11

Jan Christoph Terasa


If you don't like to rely on sub-process calls, here is the version with COM client. It is useful if you are targeting windows users without LibreOffice installed.

#!/usr/bin/env python

import glob
import win32com.client

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

for i, doc in enumerate(glob.iglob("*.doc")):
    in_file = os.path.abspath(doc)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath("out{}.docx".format(i))
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    wb.Close()

word.Quit()
like image 42
James Parker Avatar answered Nov 02 '22 12:11

James Parker