Let's say I have a list of domain names that I would like to analyze. Unless the domain name is hyphenated, I don't see a particularly easy way to "extract" the keywords used in the domain. Yet I see it done on sites such as DomainTools.com, Estibot.com, etc. For example:
ilikecheese.com becomes "i like cheese"
sanfranciscohotels.com becomes "san francisco hotels"
...
Any suggestions for accomplishing this efficiently and effectively?
Edit: I'd like to write this in PHP.
Python in its language defines an inbuilt module “keyword” which handles certain operations related to keywords. A function “iskeyword()” checks if a string is a keyword or not. Returns true if a string is a keyword, else returns false.
Ok, I ran the script I wrote for this SO question, with a couple of minor changes -- using log probabilities to avoid underflow, and modifying it to read multiple files as the corpus.
For my corpus I downloaded a bunch of files from project Gutenberg -- no real method to this, just grabbed all english-language files from etext00, etext01, and etext02.
Below are the results, I saved the top three for each combination.
expertsexchange: 97 possibilities - experts exchange -23.71 - expert sex change -31.46 - experts ex change -33.86 penisland: 11 possibilities - pen island -20.54 - penis land -22.64 - pen is land -25.06 choosespain: 28 possibilities - choose spain -21.17 - chooses pain -23.06 - choose spa in -29.41 kidsexpress: 15 possibilities - kids express -23.56 - kid sex press -32.65 - kids ex press -34.98 childrenswear: 34 possibilities - children swear -19.85 - childrens wear -25.26 - child ren swear -32.70 dicksonweb: 8 possibilities - dickson web -27.09 - dick son web -30.51 - dicks on web -33.63
Might want to check out this SO question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With