Let's pretend that I have a site where the users create topics and write threads on Fruit.
To keep the users informed of all Fruit conversations on the entire web, I collect tweets related to the specific topic and create threads based on the contents of the tweet.
It's really important that the tweets are relevant to the topic, obviously. Let's say that a user creates a topic called Apples and Oranges. I pull all tweets that contains the keywords Apples an/or Oranges.
The problem that I'm having is that some twitter users write a tweet that includes the keywords Apples, Oranges, Pears, for example, and it gets collected and posted as a thread to the Apples and Oranges discussion topic. This makes the users angry!
So what I need is a way to filter out any tweet that includes fruit words other than Apples and/or Oranges.
For example, if a twitter user writes "I love Apples, Oranges, Pears, and Grapes" then that tweet should not be included.
Now you can only make the Twitter search query so sophisticated. So the exclusion logic will have to be performed in Ruby after the tweets are collected.
Programmatically, how would you go about solving this?
Determine the words that are related to the topic name. Pears, grapes, etc. You can then exclude tweets that use these related words.
One way to do this is using Google Sets.
NOTE: I am in the unfortunate position of not fully condoning my own solution due to this service not having an official API (as awesome as this would be!). Though if you are going to use this strategy then I would suggest storing the Google Set result.
require 'google_set'
twitter_search_terms = ['apples', 'oranges']
# Mocked twitter search method
tweets = search_twitter(twitter_search_terms)
# returns ["Both apples and oranges are great!", "I love Apples, Oranges, Pears, and Grapes."]
related_words = GoogleSet.for(*twitter_search_terms)
# returns ["apples", "oranges", "bananas", "peaches", "pears", "grapes", "strawberries", "plums", ...]
related_words = (related_words - twitter_search_terms).each(&:downcase)
good_tweets = []
bad_tweets = []
tweets.each do |tweet|
tweet_words = tweet.downcase.split
# Remove any non-word characters
tweet_words = tweet_words.map { |word| word.gsub(/\W+/, '') }.compact
if (tweet_words - related_words).size == tweet_words.size
good_tweets << tweet
else
bad_tweets << tweet
end
end
p good_tweets
# returns ["Both apples and oranges are great!"]
p bad_tweets
# returns ["I love Apples, Oranges, Pears, and Grapes."]
class Fruit < AR::Base
has_many :tweets
end
class Tweet < AR::Base
belongs_to :fruit
# validation catches any tweets that mention more than one fruit
def validate
self.errors[:base] = 'Mentions too many fruit' unless single_topic?
end
def single_topic?
Fruit.count(:conditions => {:name => words).eql?(1)
end
# if validation passes the the fruit is parsed
before_create :parse_fruit_from_text
def parse_fruit_from_text
self.fruit_id = Fruit.first(:conditions => {:name => words}, :select => 'id').id
end
def words
@words ||= this.text.split(' ')
end
end
# Now you can just do...
Tweet.create(json)
You'll need to account for case differences with Fruit#names. I would suggest just saving all names as lowercase then downcasing any queries. You could also use write custom SQL queries using LOWER.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With