Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse Html Document Get All input fields with ID and Value

I have several thousand (ASP.net - messy html) html generated invoices that I'm trying to parse and save into a database.

Basically like:

 foreach(var htmlDoc in HtmlFolder)
 {
   foreach(var inputBox in htmlDoc)
   { 
      //Make Collection of ID and Values Insert to DB
   }
 }  

From all the other questions I've read the best tool for this type of problem is the HtmlAgilityPack, however for the life of me I can't get the documentation .chm file to work. Any ideas on how I could accomplish this with or without the Agility Pack ?

Thanks in advance

like image 982
bumble_bee_tuna Avatar asked Mar 21 '23 06:03

bumble_bee_tuna


2 Answers

An newer alternative to HtmlAgilityPack is CsQuery. See this later question on its relative performance merits, but its use of CSS selectors can't be beat:

var doc = CQ.CreateDocumentFromFile(htmldoc); //load, parse the file
var fields = doc["input"]; //get input fields with CSS
var pairs = fields.Select(node => new Tuple<string, string>(node.Id, node.Value()))
       //get values
like image 108
Arithmomaniac Avatar answered Apr 01 '23 03:04

Arithmomaniac


To get the CHM to work, you probably need to view the properties in Windows Explorer and uncheck the "Unblock Content" checkbox.

The HTML Agility Pack is quite easy when you know your way around Linq-to-XML or XPath.

Basics you'll need to know:

//import the HtmlAgilityPack
using HtmlAgilityPack;

HtmlDocument doc = new HtmlDocument();

// Load your data
// -----------------------------
// Load doc from file:
doc.Load(pathToFile);

// OR

// Load doc from string:
doc.LoadHtml(contentsOfFile);
// -----------------------------

// Find what you're after
// -----------------------------
// Finding things using Linq
var nodes = doc.DocumentNode.DescendantsAndSelf("input")
    .Where(node => !string.IsNullOrWhitespace(node.Id)
        && node.Attributes["value"] != null
        && !string.IsNullOrWhitespace(node.Attributes["value"].Value));

// OR

// Finding things using XPath
var nodes = doc.DocumentNode
    .SelectNodes("//input[not(@id='') and not(@value='')]");
// -----------------------------


// looping through the nodes:
// the XPath interfaces can return null when no nodes are found
if (nodes != null) 
{ 
    foreach (var node in nodes)
    {
        var id = node.Id;
        var value = node.Attributes["value"].Value;
    }
}

The easiest way to add the HtmlAgility Pack is using NuGet:

PM> Install-Package HtmlAgilityPack

like image 23
jessehouwing Avatar answered Apr 01 '23 04:04

jessehouwing