please help i am doing a html parsing using MSHTML
. My code for getting all attributes of a particular tag is like this
void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
IHTMLDOMNode *pElemDN = NULL;
LONG lACLength;
MSHTML::IHTMLAttributeCollection *pAttrColl;
IDispatch* pACDisp;
VARIANT vACIndex;
IDispatch* pItemDisp;
IHTMLDOMAttribute* pItem;
BSTR bstrName;
VARIANT vValue;
VARIANT_BOOL vbSpecified;
pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
if (pElemDN != NULL)
{
pElemDN->get_attributes(&pACDisp);
pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
pAttrColl->get_length(&lACLength);
vACIndex.vt = VT_I4;
for (int i = 0; i < lACLength; i++)
{
vACIndex.lVal = i;
pItemDisp = pAttrColl->item(&vACIndex);
if (pItemDisp != NULL)
{
pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
pItem->get_specified(&vbSpecified);
pItem->get_nodeName(&bstrName);
pItem->get_nodeValue(&vValue);
if (vbSpecified)
cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
pItem->Release();
}
pItemDisp->Release();
}
pElemDN->Release();
pACDisp->Release();
pAttrColl->Release();
}
}
The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK">
it prints all attributes except value
attribute. The get_specified
function is returning false
for value
attribute.
My output is
id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch
Any idea why? Also which other attributes may have this problem??
Note
I tried like this. Its showing the correct attribute results for value
.
if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
{
cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
}
If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.
If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.
This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2
Another one is: http://pugixml.org/
This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With