Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read word document (*.doc) content with tables etc

I have a word document (2003). I am using Powershell to parse the content of the document. The document contains a few lines of text at the top, a dozen tables with differing number of columns and then some more text.

I expect to be able to read the document as something like the below:

  1. Read document (make necessary objects etc)
  2. Get each line of text
  3. If not part of a table, process as text and Write-Output
  4. else
  5. If part of a table
  6. Get table number (by order) and parse output based on columns
  7. end if

Below is the powershell script that I have begun to write:

$objWord = New-Object -Com Word.Application
$objWord.Visible = $false
$objDocument = $objWord.Documents.Open($filename)
$paras = $objDocument.Paragraphs
foreach ($para in $paras) 
{ 
    Write-Output $para.Range.Text
}

I am not sure if Paragraphs is what I want. Is there anything more suitable for my purpose? All I am getting now is the entire content of the document. How do I control what I get. Like I want to get a line, be able to determine if it is part of a table or not and take an action based on what number table it is.

like image 455
Anoop Avatar asked Oct 27 '12 23:10

Anoop


1 Answers

You can enumerate the tables in a Word document via the Tables collection. The Rows and Columns properties will allow you to determine the number of rows/columns in a given table. Individual cells can be accessed via the Cell object.

Example that will print the value of the cell in the last row and last column of each table in the document:

$wd = New-Object -ComObject Word.Application
$wd.Visible = $true
$doc = $wd.Documents.Open($filename)
$doc.Tables | ForEach-Object {
  $_.Cell($_.Rows.Count, $_.Columns.Count).Range.Text
}
like image 93
Ansgar Wiechers Avatar answered Sep 28 '22 10:09

Ansgar Wiechers