Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to classify URLs? what are URLs features? How to select and Extract features from URL

I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block it.

My Question is very specific.

  1. How to Classify URLs? Should i use normal text analysis methods?
  2. What are URLs Features?
  3. How to Select and Extract Features from URL?
like image 635
Nasir Avatar asked Oct 20 '14 00:10

Nasir


1 Answers

I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

Here are some features I will try. See this paper for more ideas:

  1. All url components. For example, this page has the below url:

    http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

 * stackoverflow
 * com
 * questions
 * 26456904
 * how to classify urls what are urls features how to select and extract features
  1. The length of a url;
  2. n-grams (2-grams as examples below)
    • stackoverflow-com
    • com-questions
    • questions-26456904
    • 26456904-how
    • how-to
    • ....
like image 123
greeness Avatar answered Oct 24 '22 19:10

greeness