Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore HTML characters when searching HTML stored content PHP/MySQL?

Tags:

php

search

mysql

We are storing blocks of HTML content in MySQL (it is formatted text created in CKeditor. Ckeidor adds inline CSS styles to format the text). We are storing the database in a column called "page_content"

We want a search feature that allows users to search the text (only). However we don't want the search to return HTML characters which is what is happening right now. For example, if we search "font" we dont want the search results to return pages with the HTML styles <font>

Is there some way to ignore HTML/CSS characters when searching HTML stored content from MySQL?

like image 488
user635800 Avatar asked Jul 06 '11 01:07

user635800


2 Answers

Have you considered setting up a separate table for these searches? MySQL's full text search only works with MyISAM tables so you probably don't want to mix that the full text search with important data (unless, of course, you have some strange distaste for foreign keys and referential integrity).

The approach I've used in the past is basically this:

  • Set up a separate table with a simple (id, search_text) structure.
    • The id matches the id of the thing you're searching.
    • The search_text is everything (body text, title, author name, ...) that you want to search mashed into one blob of text.
  • Add full text indexing on the search table.
  • Update your database update process to construct the appropriate search_text string as plain text; this is where you would strip out the HTML and possibly apply some other mappings (such as expanding things like "A+" to something that the full text search will find).
  • When searching, you apply the same mappings as you apply to the searchable data and then go look in your search table for matches.

This solves your HTML problem, lets you search more than the HTML content easily, and allows you to tune the search results by weighting different components of the search text through repetition (e.g. if you want tags to be more important than the body text, just add the tags two or three times when building the search_text).

You'll have to process your text to remove or ignore the HTML. This approach lets you do it just once rather than doing it on every search.

like image 77
mu is too short Avatar answered Nov 09 '22 15:11

mu is too short


I am guessing you want to do the searching on the database? (in that case, stripping HTML tags would mean that you would have to store the content twice).

Try looking into using MYSQLs full text search functionality, in natural language mode.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-natural-language.html

like image 2
JustDanyul Avatar answered Nov 09 '22 15:11

JustDanyul