I am new to AI. I am working an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different type of forms on the page.
I am looking for more technical suggestions, relating to code & implementation.
I can assign labels to html tag attributes, like class or id
<div class="menu-1">
<div id="entry">
<div id="content">
<div id="footer">
<div id="comment-12">
<div id="comment-title">
like for first item:
TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div");
Inputs:
Output
What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu).
All users cannot create regex, or xpath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of html document he/she needs, using webbrowser control, and train the software till it can work on its own.
but I dont know how to make the software train using AI,
the AI I am looking for is, like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it.
It would be helpful to me if I get answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.
Linear Support Vector Machine is widely regarded as one of the best text classification algorithms. We achieve a higher accuracy score of 79% which is 5% improvement over Naive Bayes.
Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories.
XGBoost is the name of a machine learning method. It can help you to predict any kind of data if you have already predicted data before. You can classify any kind of data. It can be used for text classification too.
CNN utilizes an activation function which helps it run in kernel (i.e) high dimensional space for neural processing. For Natural language processing, text classification is a topic in which one needs to set predefined classes to free-text documents.
I suggest you to look into simpler algorithms first which are easy to understand, I can give pointers to some.
SVM (Eg. LibSVM for C# port).
If you want to get a taste of how these work, download the WEKA toolkit:
http://sourceforge.net/projects/weka/
The commonly followed steps are usually the following:
Good Luck!
This is a very broad topic. There are a few neural network libraries out there for C#, just search for them on Stack Overflow.
You will need to perform supervised training before you can do any type of classification. In order for the ANN to understand what you are throwing at it, you will need to figure out how you will parse the HTML to get the results you are looking for.
As an example, most websites will use CSS to render content on a browser. Other sites may use tables. You will need to train for both.
Your problem is not an easy one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With