So I have about 4,000 word docs that I'm attempting to extract the text from and insert into a db table. This works swimmingly until the processor encounters a document with the *.doc
file extension but determines the file is actually an RTF. Now I know POI doesn't support RTFs which is fine, but I do need a way to determine if a *.doc
file is actually an RTF so that I can choose to ignore the file and continue processing.
I've tried several techniques to overcome this, including using ColdFusion's MimeTypeUtils, however, it seems to base its assumption of the mimetype on the file extension and still classifies the RTF as application/msword. Is there any other way to determine if a *.doc
is an RTF? Any help would be hugely appreciated.
The first five bytes in any RTF file should be:
{\rtf
If they aren't, it's not an RTF file.
The external links section in the Wikipeida article link to the specifications for the various versions of RTF.
Doc files (at least those since Word '97) use something called "Windows Compound Binary Format", documented in a PDF here. According to that, these Doc files start with the following sequence:
0xd0, 0xcf, 0x11, 0xe0, 0xa1, 0xb1, 0x1a, 0xe1
Or in older beta files:
0x0e, 0x11, 0xfc, 0x0d, 0xd0, 0xcf, 0x11, 0xe0
According to the Wikipedia article on Word, there were at least 5 different formats prior to '97.
Looking for {\rtf should be your best bet.
Good luck, hope this helps.
With CF8 and compatible:
<cffunction name="IsRtfFile" returntype="Boolean" output="false">
<cfargument name="FileName" type="String" />
<cfreturn Left(FileRead(Arguments.FileName),5) EQ '{\rtf' />
</cffunction>
For earlier versions:
<cffunction name="IsRtfFile" returntype="Boolean" output="false">
<cfargument name="FileName" type="String" />
<cfset var FileData = 0 />
<cffile variable="FileData" action="read" file="#Arguments.FileName#" />
<cfreturn Left(FileData,5) EQ '{\rtf' />
</cffunction>
Update: A better CF8/compatible answer. To avoid loading the whole file into memory, you can do the following to load just the first few characters:
<cffunction name="IsRtfFile" returntype="Boolean" output="false">
<cfargument name="FileName" type="String" />
<cfset var FileData = 0 />
<cfloop index="FileData" file="#Arguments.FileName#" characters="5">
<cfbreak/>
</cfloop>
<cfreturn FileData EQ '{\rtf' />
</cffunction>
Based on the comments:
Here's a very quick way how you might do a generate "what format is this" type of function. Not perfect, but it gives you the idea...
<cffunction name="determineFileFormat" returntype="String" output="false"
hint="Determines format of file based on header of the file's data."
>
<cfargument name="FileName" type="String"/>
<cfset var FileData = 0 />
<cfset var CurFormat = 0 />
<cfset var MaxBytes = 8 />
<cfset var Formats =
{ WordNew : 'D0,CF,11,E0,A1,B1,1A,E1'
, WordBeta : '0E,11,FC,0D,D0,CF,11,E0'
, Rtf : '7B,5C,72,74,66' <!--- {\rtf --->
, Jpeg : 'FF,D8'
}/>
<cfloop index="FileData" file="#Arguments.FileName#" characters="#MaxBytes#">
<cfbreak/>
</cfloop>
<cfloop item="CurFormat" collection="#Formats#">
<cfif Left( FileData , ListLen(Formats[CurFormat]) ) EQ convertToText(Formats[CurFormat]) >
<cfreturn CurFormat />
</cfif>
</cfloop>
<cfreturn "Unknown"/>
</cffunction>
<cffunction name="convertToText" returntype="String" output="false">
<cfargument name="HexList" type="String" />
<cfset var Result = "" />
<cfset var CurItem = 0 />
<cfloop index="CurItem" list="#Arguments.HexList#">
<cfset Result &= Chr(InputBaseN(CurItem,16)) />
</cfloop>
<cfreturn Result />
</cffunction>
Of course, worth pointing out that all this wont work on 'headerless' formats, including many common text-based ones (CFM,CSS,JS,etc).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With