I want to try and determine the characteristics of a user's personality based on the words they input into a search box. Here's an example:
Search term: "computers"
Personality/descriptors detected: analytical, logical, systematic, methodical
I understand that this task is extremely non-trivial. I have used WordNet before, but I'm not sure if it includes adjective clouds for each noun node. Part-of-speech tagging is a beast of its own, so I'm not sure that building my own corpus and searching for adjective term-frequencies that coexist with keywords is the best idea, but I'll explain it below.
I am currently working with a Wikipedia dump, processing each article for term frequency after having removed stop words (and, or, of, to, a, etc...). My thought was to possibly search for the coexistence of adjectives (using WordNet for POS tagging) and nouns throughout the corpus (eg. the adjective logical often co-occurs with the noun computer), and, based on the relative, stemmed-adjective frequency, judge it to be semantically related to the noun or not. The potential applications are immense.
Another idea is to stem the noun, search for adjectives that begin with that stem, then search for synonyms of that adjective. Example:
Search term: "computers"
Stem: "comput-"
Adjectives with stem: computational
Synonyms: ???
The problem is that adjective forms of nouns don't always have adjective forms, and some noun stems will match to horribly wrong adjectives. *BAD*example:
Search term: "running" (technically a gerund, but still a noun)
Stem: "run-"
Adjectives with stem: runny
Synonyms: NOT THE WORDS I WANT. Would like to find words like "athletic", "motivated", "disciplined"
Is this something that has been done before? Do you have suggestions regarding how I might approach this? It's almost as if I'm seeking to generate adjective clouds for the "important" words in a document.
EDIT: I realize that there is no "correct" answer to this problem. I will reward the bounty to whomever presents a method with the best theoretical potential.
Assuming you have some hefty computational resources to throw at this, I would suggest using something simple like Hyperspace Analog of Language (HAL) to build up a Term X Term matrix for your dump of Wikipedia. Then, your algorithm could be something like:
This approach basically trades off memory and computational efficiency for simplicity in terms of code and data structures. Yet, it should do pretty well for what I think you want. The first step will give you adjectives that are most commonly associated with the query term, while the vector similarity in the HAL space (step 3) will give words that are paradigmatically related (roughly, can be substituted for one another, so if you start with an adjective of a certain sort, you should get more adjectives "like it" in terms of its relationship with the query term), which should be a fairly good proxy for the "cloud" you are looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With