I want to do some resume information extraction with python. Some of the resumes come in .doc extension, and I have a code from another answer for converting those .doc files to .docx files. The code is as follows:
import win32com.client as win32
from win32com.client import constants
def save_as_docx(path):
word = win32.gencache.EnsureDispatch('Word.Application')
doc = word.Documents.Open(path)
doc.Activate ()
new_file_abs = os.path.abspath(path)
new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)
word.ActiveDocument.SaveAs(
new_file_abs, FileFormat=constants.wdFormatXMLDocument
)
doc.Close(False)
save_as_docx(some_path)
I don't really understand how win32com.client works, so before even using this piece of code in a real situation my question is:
Does this code upload the resume somewhere in order to convert it to .docx? Can there be any privacy or information leak problem that I should care about if I use this?
Thanks in advance.
This code does not upload the resume anywhere. I don't see a privacy concern you should have with it. You can read more about COM at https://en.wikipedia.org/wiki/Component_Object_Model but in short, COM is a system on Windows that is supposed to help different applications communicate. Almost anything a Microsoft app does can also be done with COM but it's often really hard to figure out exactly how. In Python on Windows, win32com is a way to access the native COM interfaces, and the client is local to one process and communicating with another process on your local machine.
To annotate your code:
import win32com.client as win32
from win32com.client import constants
def save_as_docx(path):
# Access the Word application
word = win32.gencache.EnsureDispatch('Word.Application')
# Cause word to open a document
doc = word.Documents.Open(path)
# Make the document the active document, so the rest of the commands apply to it
# https://learn.microsoft.com/en-us/office/vba/api/word.document.activate
doc.Activate ()
# make the path absolute, replace the `doc` with `docx`
new_file_abs = os.path.abspath(path)
new_file_abs = re.sub(r'\.\w+$', '.docx', new_file_abs)
# Save the document in the docx format at a new location
word.ActiveDocument.SaveAs(
new_file_abs, FileFormat=constants.wdFormatXMLDocument
)
# Close the document
doc.Close(False)
save_as_docx(some_path)
You can read that and see that COM is instrumenting the Word application and it allows you to automate something you might have thought can only be done via GUI. But rest assured it's all done locally on your machine, unless for some reason the Word application is uploading documents for you somehow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With