I am new to AI. I am working an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different type of forms on the page. <ol> <li>It would be very helpful if anyone could provide detailed guidance on this subject.</li> <li>Examples of similar application, would also be very helpful.</li> </ol> I am looking for more technical suggestions, relating to code & implementation. I can assign labels to html tag attributes, like class or id <pre class="prettyprint"><code><div class="menu-1"> <div id="entry"> <div id="content"> <div id="footer"> <div id="comment-12"> <div id="comment-title"> </code></pre> like for first item: TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div"); Inputs: <ol> <li>"menu-1" (attribute value)</li> <li>List item</li> <li>"class" (attribute name)</li> <li>"21" (tag position in string)</li> <li>"div" (tag name)</li> </ol> Output <ol> <li>"Menu" (classified as label)</li> </ol> What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu). All users cannot create regex, or xpath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of html document he/she needs, using webbrowser control, and train the software till it can work on its own. but I dont know how to make the software train using AI, the AI I am looking for is, like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it. It would be helpful to me if I get answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.

I suggest you to look into simpler algorithms first which are easy to understand, I can give pointers to some. <ol> <li>Naive Bayes (you will find many implementations but you can do it yourself, the algo is simple to implement yet quite powerful).</li> <li>Maximum Entropy (Eg. SharpMaxEnt - open source).</li> <li> SVM (Eg. LibSVM for C# port). If you want to get a taste of how these work, download the WEKA toolkit: <pre class="prettyprint"><code>http://sourceforge.net/projects/weka/ </code></pre> The commonly followed steps are usually the following: <ol> <li>Identify as many attributes/features as you can get (and a set of labels).</li> <li>Collect data which is a set { Label, Attribute1, A2, A3, ... }</li> <li>Select a minimal set of important attributes using feature selection algorithms (also available in the WEKA toolkit)</li> <li>Train the classifier using standard algorithm</li> <li>Test the system, until you receive the desired accuracy,recall, or other params.</li> </ol> Good Luck! </li> </ol>

Artificial Intelligence, Text Classifier [closed]

Tags:

c#

artificial-intelligence

winforms

neural-network

bayesian

I am new to AI. I am working an application that text classification via machine learning. The application needs to classify different parts of an HTML document. For example, most webpages have head, menu, sidebar, footer, main content, etc. I want to use a text classifier to classify these parts of an HTML document, and to identify different type of forms on the page.

It would be very helpful if anyone could provide detailed guidance on this subject.
Examples of similar application, would also be very helpful.

I am looking for more technical suggestions, relating to code & implementation.

I can assign labels to html tag attributes, like class or id

Click to copy

<div class="menu-1">
<div id="entry">
<div id="content">
<div id="footer">
<div id="comment-12">
<div id="comment-title">

like for first item:

TrainClassifier(label: "Menu", value: "menu-1", attribute: "class", position-in-string: "21%", tag: "div");

Inputs:

"menu-1" (attribute value)
List item
"class" (attribute name)
"21" (tag position in string)
"div" (tag name)

Output

"Menu" (classified as label)

What neural network library, can take the above inputs, and classify them in to labels (i.e. Menu).

All users cannot create regex, or xpath, they need more easy approach, so it is important, to make the software intelligent, user can highlight the part of html document he/she needs, using webbrowser control, and train the software till it can work on its own.

but I dont know how to make the software train using AI,

the AI I am looking for is, like it should be able to accept various inputs, and classify on the basis of that, as I have already said new to AI, don't know much about it.

It would be helpful to me if I get answer to the question I have asked, like what library I should use, and how to implement, answers suggesting Xpath or Regex or other methods pls don't answer, it often happens that you get all suggestions but the one you need.

743

asked Aug 19 '11 11:08

Milan Solanki

2 Answers

I suggest you to look into simpler algorithms first which are easy to understand, I can give pointers to some.

Naive Bayes (you will find many implementations but you can do it yourself, the algo is simple to implement yet quite powerful).
Maximum Entropy (Eg. SharpMaxEnt - open source).
SVM (Eg. LibSVM for C# port).

If you want to get a taste of how these work, download the WEKA toolkit:

Click to copy
```
http://sourceforge.net/projects/weka/
```
The commonly followed steps are usually the following:
1. Identify as many attributes/features as you can get (and a set of labels).
2. Collect data which is a set { Label, Attribute1, A2, A3, ... }
3. Select a minimal set of important attributes using feature selection algorithms (also available in the WEKA toolkit)
4. Train the classifier using standard algorithm
5. Test the system, until you receive the desired accuracy,recall, or other params.
Good Luck!

answered Sep 28 '22 08:09

binit

This is a very broad topic. There are a few neural network libraries out there for C#, just search for them on Stack Overflow.

You will need to perform supervised training before you can do any type of classification. In order for the ANN to understand what you are throwing at it, you will need to figure out how you will parse the HTML to get the results you are looking for.

As an example, most websites will use CSS to render content on a browser. Other sites may use tables. You will need to train for both.

Your problem is not an easy one.

answered Sep 28 '22 06:09

Joshua Dale

Related questions
                            
                                Dynamic Keyword, C# and Interop?
                            
                                convert a flat database resultset into hierarchical object collection in C#
                            
                                Making a SortedList readonly
                            
                                Nullsafe navigation in c# [duplicate]
                            
                                datagridview export to excel
                            
                                ASP.NET Profile save overwritten by old values
                            
                                Making Generics with many types
                            
                                Why does my custom XML not carry over to a new version of a DOCX file when Word saves it?
                            
                                Are anonymous backing fields still created for virtual auto-implemented properties when they are overridden?
                            
                                SQL Server best method to match word phrases and order relevence
                            
                                Understanding .NET Application Memory Size
                            
                                Certificates - What is KSP and CSP
                            
                                Deserialize XML Messages to Objects
                            
                                C# PropertyPath
                            
                                Is a Cache of type T possible?
                            
                                CA2000: object not disposed along all exception paths
                            
                                Epson Point of Sale Printer - Unable to print using c#
                            
                                How to write Javascript in the code behind file?
                            
                                Creating new instances while still using Dependency Injection
                            
                                Is there a suggested pattern for using LINQ between the Model & DataAccess Layers in a DDD based Layered Architecture

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With