Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I go about creating an efficient content filter for certain posts?

I've tagged this post as WordPress, but I'm not entirely sure it's WordPress-specific, so I'm posting it on StackOverflow rather than WPSE. The solution doesn't have to be WordPress-specific, simply PHP.

The Scenario
I run a fishkeeping website with a number of tropical fish Species Profiles and Glossary entries.

Our website is oriented around our profiles. They are, as you may term it, the bread and butter of the website.

What I'm hoping to achieve is that, in every species profile which mentions another species or a glossary entry, I can replace those words with a link - such as you'll see here. Ideally, I would also like this to occur in news, articles and blog posts too.

We have nearly 1400 species profiles and 1700 glossary entries. Our species profiles are often lengthy and at last count our species profiles alone numbered more than 1.7 million words of information.

What I'm Currently Attempting
Currently, I have a filter.php with a function that - I believe - does what I need it to do. The code is quite lengthy, and can be found in full here.

In addition, in my WordPress theme's functions.php, I have the following:

# ==============================================================================================
# [Filter]
#
# Every hour, using WP_Cron, `my_updated_posts` is checked. If there are new Post IDs in there,
# it will run a filter on all of the post's content. The filter will search for Glossary terms
# and scientific species names. If found, it will replace those names with links including a 
# pop-up.

    include "filter.php";

# ==============================================================================================
# When saving a post (new or edited), check to make sure it isn't a revision then add its ID
# to `my_updated_posts`.

    add_action( 'save_post', 'my_set_content_filter' );
    function my_set_content_filter( $post_id ) {
        if ( !wp_is_post_revision( $post_id ) ) {

            $post_type = get_post_type( $post_id );

            if ( $post_type == "species" || ( $post_type == "post" && in_category( "articles", $post_id ) ) || ( $post_type == "post" && in_category( "blogs", $post_id ) ) ) {
                //get the previous value
                $ids = get_option( 'my_updated_posts' );

                //add new value if necessary
                if( !in_array( $post_id, $ids ) ) {
                    $ids[] = $post_id;
                    update_option( 'my_updated_posts', $ids );
                }
            }
        }
    }

# ==============================================================================================
# Add the filter to WP_Cron.

    add_action( 'my_filter_posts_content', 'my_filter_content' );
    if( !wp_next_scheduled( 'my_filter_posts_content' ) ) {
        wp_schedule_event( time(), 'hourly', 'my_filter_posts_content' );
    }

# ==============================================================================================
# Run the filter.

    function my_filter_content() {
        //check to see if posts need to be parsed
        if ( !get_option( 'my_updated_posts' ) )
            return false;

        //parse posts
        $ids = get_option( 'my_updated_posts' );

        update_option( 'error_check', $ids );

        foreach( $ids as $v ) {
            if ( get_post_status( $v ) == 'publish' )
                run_filter( $v );

            update_option( 'error_check', "filter has run at least once" );
        }

        //make sure no values have been added while loop was running
        $id_recheck = get_option( 'my_updated_posts' );
        my_close_out_filter( $ids, $id_recheck );

        //once all options, including any added during the running of what could be a long cronjob are done, remove the value and close out
        delete_option( 'my_updated_posts' );
        update_option( 'error_check', 'working m8' );
        return true;
    }

# ==============================================================================================
# A "difference" function to make sure no new posts have been added to `my_updated_posts` whilst
# the potentially time-consuming filter was running.

    function my_close_out_filter( $beginning_array, $end_array ) {
        $diff = array_diff( $beginning_array, $end_array );
        if( !empty ( $diff ) ) {
            foreach( $diff as $v ) {
                run_filter( $v );
            }
        }
        my_close_out_filter( $end_array, get_option( 'my_updated_posts' ) );
    }

The way this works, as (hopefully) described by the code's comments, is that each hour WordPress operates a cron job (which is like a false cron - works upon user hits, but that doesn't really matter as the timing isn't important) which runs the filter found above.

The rationale behind running it on an hourly basis was that if we tried to run it when each post was saved, it would be to the detriment of the author. Once we get guest authors involved, that is obviously not an acceptable way of going about it.

The Problem...
For months now I've been having problems getting this filter running reliably. I don't believe that the problem lies with the filter itself, but with one of the functions that enables the filter - i.e. the cron job, or the function that chooses which posts are filtered, or the function which prepares the wordlists etc. for the filter.

Unfortunately, diagnosing the problem is quite difficult (that I can see), thanks to it running in the background and only on an hourly basis. I've been trying to use WordPress' update_option function (which basically writes a simple database value) to error-check, but I haven't had much luck - and to be honest, I'm quite confused as to where the problem lies.

We ended up putting the website live without this filter working correctly. Sometimes it seems to work, sometimes it doesn't. As a result, we now have quite a few species profiles which aren't correctly filtered.

What I'd Like...
I'm basically seeking advice on the best way to go about running this filter.

Is a Cron Job the answer? I can set up a .php file which runs every day, that wouldn't be a problem. How would it determine which posts need to be filtered? What impact would it have on the server at the time it ran?

Alternatively, is a WordPress admin page the answer? If I knew how to do it, something along the lines of a page - utilising AJAX - which allowed me to select the posts to run the filter on would be perfect. There's a plugin called AJAX Regenerate Thumbnails which works like this, maybe that would be the most effective?

Considerations

  • The size of the database/information being affected/read/written
  • Which posts are filtered
  • The impact the filter has on the server; especially considering I don't seem to be able to increase the WordPress memory limit past 32Mb.
  • Is the actual filter itself efficient, effective and reliable?

This is quite a complex question and I've inevitably (as I was distracted roughly 18 times by colleagues in the process) left out some details. Please feel free to probe me for further information.

Thanks in advance,

like image 563
turbonerd Avatar asked Jun 15 '12 15:06

turbonerd


People also ask

How do you filter content on social media?

Go to your profile and select the menu button (3 horizontal lines) > Click Settings > Privacy > Comments > Use the toggle to turn on “Manual Filter” > Enter any words or phrases that you do not want to see, separating words and phrases with a comma.


1 Answers

Do it when the profile is created.

Try reversing the whole process. Rather than checking the content for the words, check the words for the content's words.

  1. Break the content post on entry into words (on space)
  2. Eliminate duplicates, ones under the smallest size of a word in the database, ones over the largest size, and ones in a 'common words' list that you keep.
  3. Check against each table, if some of your tables include phrases with spaces, do a %text% search, otherwise do a straight match (much faster) or even build a hash table if it really is that big a problem. (I would do this as a PHP array and cache the result somehow, no sense reinventing the wheel)
  4. Create your links with the now dramatically smaller lists.

You should be able to easily keep this under 1 second even as you move out to even 100,000 words you are checking against. I've done exactly this, without caching the word lists, for a Bayesian Filter before.

With the smaller list, even if it is greedy and gathers words that don't match "clown" will catch "clown loach", the resulting smaller list should be only a few to a few dozen words with links. Which will take no time at all to do a find and replace over a chunk of text.

The above doesn't really address your concern over the older profiles. You don't say exactly how many there are, just that there is a lot of text and that it is on 1400 to 3100 (both items) put together. This older content you could do based on popularity if you have the info. Or on date entered, newest first. Regardless the best way to do this is to write a script that suspends the time limit on PHP and just batch-runs a load/process/save on all the posts. If each one takes about 1 second (probably much less but worst case) you are talking 3100 seconds which is a little less than an hour.

like image 94
DampeS8N Avatar answered Sep 28 '22 04:09

DampeS8N