I have two mysql tables. One is a bad words list, the other is the table to compare against the bad words list. Essentially I want to filter out and return a list of rows with domains that do not have ANY occurrence of a word in the bad words table. A few sample tables:
bad words list
+----------+------------------+
| id | words |
+----------+------------------+
| 1 | porn |
| 2 | sex |
+----------+------------------+
table of domains to compare
+----------+------------------+
| id | domain |
+----------+------------------+
| 56 | google.com |
| 57 | sex.com |
+----------+------------------+
I want to return results such as
+----------+------------------+
| id | domain |
+----------+------------------+
| 56 | google.com |
+----------+------------------+
A thing to note is that these tables have nothing in common, so I'm not even sure this is the best method. I was using a comparison function in PHP but that seemed to be way too slow over hundreds of thousands of rows to search.
PHP, originally derived from Personal Home Page Tools, now stands for PHP: Hypertext Preprocessor, which the PHP FAQ describes as a "recursive acronym." PHP executes on the server, while a comparable alternative, JavaScript, executes on the client.
PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages.
PHP is an open-source, server-side programming language that can be used to create websites, applications, customer relationship management systems and more. It is a widely-used general-purpose language that can be embedded into HTML.
Python is better than PHP in long term project. PHP has low learning curve, it is easy to get started with PHP. Compare to PHP Python has lower number of Frameworks. Popular ones are DJango, Flask.
It is possible to get from mysql. like this:
SELECT
d.*
FROM
domains d
LEFT JOIN
words w ON(d.domain LIKE CONCAT('%',w.word,'%') )
GROUP BY
d.domain
HAVING
COUNT(w.id) < 1
but it is not optimal and will get slower and slower with more records in both tables.
Data like this typically needs to be pre-calculated at insertion time rather than at fetch time. You should add a column to Domains something like "bad_words boolean default null".
null would mean "don't know" which in some context could be interpretted as "unsafe to show". false means "no bad words" and true means "contains bad words".
Everytime the list of bad words is updated all columns are reset to null and some background work will start to process them again. Probably in another language than sql.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With