Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to ignore html tags in Sql Server 2008 Full Text Search

I'm working on a knowledge base project using SQL Server 2008 Full Text Search Engine. Project included in articles and files where each article has multiple files.In those articles whole content is pure html.

Right now,I successfully created fulltext catalog and index on SQL Server 2008 and my database is version 10 compatible.

Here are my questions:

1)Is it possible to ignore html tags,more clearly texts containing in "<...>", while searching in these articles,because if i wish to search for div,table etc. there should be no result returned?

2)Articles will be updated anytime,so full text index must be updated when a new record is inserted.Is it enough to set only "TRACK CHANGES AUTOMATIC" while creating full text catalog?

3)We may use FILESTREAM feature hereafter,does SQL Server 2008 have a good performance on files using full text index? What specific document types does SQL Server 2008 good on indexing?

Regards

like image 771
Myra Avatar asked Jan 12 '10 09:01

Myra


People also ask

Does SQL Server support full text search?

Full-Text Search in SQL Server and Azure SQL Database lets users and applications run full-text queries against character-based data in SQL Server tables.

What is free text search in SQL Server?

When FREETEXT is used, the full-text query engine internally performs the following actions on the freetext_string, assigns each term a weight, and then finds the matches: Separates the string into individual words based on word boundaries (word-breaking). Generates inflectional forms of the words (stemming).


1 Answers

there is a filter for .htm and .html files.

to see if you have the filter installed run this sql:

SELECT * FROM sys.fulltext_document_types

you should see:

.htm E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

.html E0CA5340-4534-11CF-B952-00AA0051FE20 C:\Program Files\Microsoft SQL Server\MSSQL10.MSSQLSERVER\MSSQL\Binn\nlhtml.dll 12.0.6828.0 Microsoft Corporation

so, if you can convert your articles column to varbinary(max), then you can add a full text index on it and specify a doc type of '.html'

once the index has populated, you can verify the keywords using this sql:

SELECT display_term, column_id, document_count
FROM sys.dm_fts_index_keywords
(DB_ID('your_db'), OBJECT_ID('your_table')) 
like image 93
DC. Avatar answered Sep 21 '22 17:09

DC.