Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using full-text search with PDF files in SQL Server 2005

I've got a strange problem with indexing PDF files in SQL Server 2005, and hope someone can help. My database has a table called MediaFile with the following fields - MediaFileId int identity pk, FileContent image, and FileExtension varchar(5). I've got my web application storing file contents in this table with no problems, and am able to use full-text searching on doc, xls, etc with no problems - the only file extension not working is PDF. When performing full-text searches on this table for words which I know exist inside of PDF files saved in the table, these files are not returned in the search results.

The OS is Windows Server 2003 SP2, and I've installed Adobe iFilter 6.0. Following the instructions on this blog entry, I executed the following commands:

exec sp_fulltext_service 'load_os_resources', 1;
exec sp_fulltext_service 'verify_signature', 0;

After this, I restarted the SQL Server, and verified that the iFilter for the PDF extensions is installed correctly by executing the following command:

select document_type, path from sys.fulltext_document_types where document_type = '.pdf' 

This returns the following information, which looks correct:

document_type: .pdf
path: C:\Program Files\Adobe\PDF IFilter 6.0\PDFFILT.dll

Then I (re)created the index on the MediaFile table, selecting FileContent as the column to index and the FileExtension as its type. The wizard creates the index and completes successfully. To test, I'm performing a search like this:

SELECT MediaFileId, FileExtension FROM MediaFile WHERE CONTAINS(*, '"house"');

This returns DOC files which contain this term, but not any PDF files, although I know that there are definitely PDF files in the table which contain the word house.

Incidentally, I got this working once for a few minutes, where the search above returned the correct PDF files, but then it just stopped working again for no apparent reason.

Any ideas as to what could be stopping SQL Server 2005 from indexing PDF's, even though Adobe iFilter is installed and appears to be loaded?

like image 225
Mun Avatar asked Nov 07 '08 16:11

Mun


1 Answers

Thanks Ivan. Managed to eventually get this working by starting everything from scratch. It seems like the order in which things are done makes a big difference, and the advice given on the linked blog to to turn off the 'load_os_resources' setting after loading the iFilter probably isn't the best option, as this will cause the iFilter to not be loaded when the SQL Server is restarted.

If I recall correctly, the sequence of steps that eventually worked for me was as follows:

  1. Ensure that the table does not have an index already (and if so, delete it)
  2. Install Adobe iFilter
  3. Execute the command exec sp_fulltext_service 'load_os_resources', 1;
  4. Execute the command exec sp_fulltext_service 'verify_signature', 0;
  5. Restart SQL Server
  6. Verify PDF iFilter is installed
  7. Create full-text index on table
  8. Do full re-index

Although this did the trick, I'm quite sure I performed these steps a few times before it eventually started working properly.

like image 172
Mun Avatar answered Oct 18 '22 18:10

Mun