Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to build a database from a MS Word document?

Please advise me on how to approach this problem:

I have a sequential list of metadata in a document in MS Word. The basic idea is to create a Python algorithm to iterate over the information, retrieving just the name of the PROCESS, when is made a queue, from a database.

Example metadata:

Process: Process Walker (1965)
Exact reference: Walker Process Equipment., Inc. v. Food Machinery Corp.

Link: http://caselaw.lp.findlaw.com/scripts/getcase.pl?court=US&vol=382&invol=

Type of procedure: Certiorari to the United States Court of Appeals for the Seventh Circuit. Parties: Walker Process Equipment, Inc.

Sector: Systems is ...

Start Date: October 12-13 Arguedas, 1965
Summary: Food Machinery Company has initiated a process to stop or slow the entry of competitors through the use of a patent obtained by fraud. The case concerned a patent on "knee action swing diffusers" used in aeration equipment for sewage treatment systems, and the question was whether "the maintenance and enforcement of a patent obtained by fraud before the patent office" may be a basis for antitrust punishment.
Report of the evolution process: petitioner, in answer to respond...

Importance: a) First case which established an analysis for the diagnosis of dispute…

There are about 200 pages containing the information above.

I have in mind the idea of implementing an algorithm in Python to be able to break this information sequence and try to store it in a web database (an open source application that I’m looking for) in order to allow for free consultations.

like image 575
Jayron Soares Avatar asked Feb 23 '11 22:02

Jayron Soares


People also ask

Can I use Microsoft Word to create a database?

Microsoft Word has a Mail Merge feature that links a Word document with information stored in a data file, called a data source. The data source can be a database. Before the merge, you can create and manage a database within Word. The database is saved as an Access database file within Word.


1 Answers

Check out AntiWord for converting the document to plaintext, then grep and sed to convert to a format you can pipe into your script.

like image 194
Aneurysm9 Avatar answered Sep 25 '22 16:09

Aneurysm9