Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

read an MSWord file into R

Tags:

r

ms-word

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.

I am using the line:

my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')

to try to read an MSWord file containing the following text:

A   20  1000    AA
B   30  1001    BB
C   10  1500    CC

I get a warning message that says:

Warning message: In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") : incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'

and my.data appears to be gibberish:

# [1] "PK\003\004\024" "¤l"             "ÈFÃË‹Átí"

I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.

I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.

Thank you for any suggestions.

like image 858
Mark Miller Avatar asked Jun 20 '12 00:06

Mark Miller


People also ask

How do I scrape a Word document?

Change part of a document to landscape Select the content that you want on a landscape page. Go to Layout, and open the Page Setup dialog box. Select Landscape, and in the Apply to box, choose Selected text.


1 Answers

In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).

like image 102
Amit Kohli Avatar answered Sep 28 '22 03:09

Amit Kohli