Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tweets exclusion

Let's pretend that I have a site where the users create topics and write threads on Fruit.

To keep the users informed of all Fruit conversations on the entire web, I collect tweets related to the specific topic and create threads based on the contents of the tweet.

It's really important that the tweets are relevant to the topic, obviously. Let's say that a user creates a topic called Apples and Oranges. I pull all tweets that contains the keywords Apples an/or Oranges.

The problem that I'm having is that some twitter users write a tweet that includes the keywords Apples, Oranges, Pears, for example, and it gets collected and posted as a thread to the Apples and Oranges discussion topic. This makes the users angry!

So what I need is a way to filter out any tweet that includes fruit words other than Apples and/or Oranges.

For example, if a twitter user writes "I love Apples, Oranges, Pears, and Grapes" then that tweet should not be included.

Now you can only make the Twitter search query so sophisticated. So the exclusion logic will have to be performed in Ruby after the tweets are collected.

Programmatically, how would you go about solving this?

like image 894
keruilin Avatar asked Feb 26 '23 00:02

keruilin


2 Answers

Determine the words that are related to the topic name. Pears, grapes, etc. You can then exclude tweets that use these related words.

One way to do this is using Google Sets.

NOTE: I am in the unfortunate position of not fully condoning my own solution due to this service not having an official API (as awesome as this would be!). Though if you are going to use this strategy then I would suggest storing the Google Set result.

require 'google_set'

twitter_search_terms = ['apples', 'oranges']
# Mocked twitter search method
tweets = search_twitter(twitter_search_terms)
# returns ["Both apples and oranges are great!", "I love Apples, Oranges, Pears, and Grapes."]

related_words = GoogleSet.for(*twitter_search_terms)
# returns ["apples", "oranges", "bananas", "peaches", "pears", "grapes", "strawberries", "plums", ...]
related_words = (related_words - twitter_search_terms).each(&:downcase)

good_tweets = []
bad_tweets = []
tweets.each do |tweet|
  tweet_words = tweet.downcase.split
  # Remove any non-word characters
  tweet_words = tweet_words.map { |word| word.gsub(/\W+/, '') }.compact

  if (tweet_words - related_words).size == tweet_words.size
    good_tweets << tweet
  else
    bad_tweets << tweet
  end
end

p good_tweets
# returns ["Both apples and oranges are great!"]

p bad_tweets
# returns ["I love Apples, Oranges, Pears, and Grapes."]
like image 144
Walking Wiki Avatar answered Feb 27 '23 13:02

Walking Wiki


class Fruit < AR::Base
  has_many :tweets
end

class Tweet < AR::Base
  belongs_to :fruit

  # validation catches any tweets that mention more than one fruit
  def validate
    self.errors[:base] = 'Mentions too many fruit' unless single_topic?
  end

  def single_topic?
    Fruit.count(:conditions => {:name => words).eql?(1)
  end

  # if validation passes the the fruit is parsed
  before_create :parse_fruit_from_text

  def parse_fruit_from_text
    self.fruit_id = Fruit.first(:conditions => {:name => words}, :select => 'id').id
  end

  def words
    @words ||= this.text.split(' ')
  end

end

# Now you can just do...
Tweet.create(json)

You'll need to account for case differences with Fruit#names. I would suggest just saving all names as lowercase then downcasing any queries. You could also use write custom SQL queries using LOWER.

like image 28
Jordan Avatar answered Feb 27 '23 12:02

Jordan