Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In HTML parsing Get Attributes of a tag in Cpp using IHTMLDOMAttribute

please help i am doing a html parsing using MSHTML. My code for getting all attributes of a particular tag is like this

void GetAttributes(MSHTML::IHTMLElementPtr pColumnInnerElement)
{
    IHTMLDOMNode *pElemDN = NULL;
    LONG lACLength;
    MSHTML::IHTMLAttributeCollection *pAttrColl;
    IDispatch* pACDisp;
    VARIANT vACIndex;
    IDispatch* pItemDisp;
    IHTMLDOMAttribute* pItem;
    BSTR bstrName;
    VARIANT vValue;
    VARIANT_BOOL vbSpecified;
    pColumnInnerElement->QueryInterface(IID_IHTMLDOMNode, (void**)&pElemDN);
    if (pElemDN != NULL)
    {
        pElemDN->get_attributes(&pACDisp);
        pACDisp->QueryInterface(IID_IHTMLAttributeCollection, (void**)&pAttrColl);
        pAttrColl->get_length(&lACLength);
        vACIndex.vt = VT_I4;
        for (int i = 0; i < lACLength; i++)
        {

            vACIndex.lVal = i;
            pItemDisp = pAttrColl->item(&vACIndex);
            if (pItemDisp != NULL)
            {
               pItemDisp->QueryInterface(IID_IHTMLDOMAttribute, (void**)&pItem);
               pItem->get_specified(&vbSpecified);
               pItem->get_nodeName(&bstrName);
               pItem->get_nodeValue(&vValue);

               if (vbSpecified)
                cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
               pItem->Release();
            }
            pItemDisp->Release();

        }
        pElemDN->Release();
        pACDisp->Release();
        pAttrColl->Release();
    }
}

The problem is for given tag <input id="Switch l_id2" class="pointer" name="Switch" onclick='SetControl("Switch l",1)' type="button" value="OK"> it prints all attributes except value attribute. The get_specified function is returning false for value attribute.

My output is

id :Switch l_id2
class :pointer
onclick :SetControl("Switch l",1)
type :button
name :Switch

Any idea why? Also which other attributes may have this problem??

Note

I tried like this. Its showing the correct attribute results for value.

        if (strcmp(_com_util::ConvertBSTRToString(bstrName), "value") == 0)
        {
            cout<<_com_util::ConvertBSTRToString(bstrName)<<" :"<<_com_util::ConvertBSTRToString(vValue.bstrVal)<<endl;
        }
like image 455
999k Avatar asked Dec 27 '22 02:12

999k


1 Answers

If you are working in managed(CLI) VC++ then you can consider the HTML Agility Pack, available via nuget.

If sticking to MSHTML is not necessary then probably you can opt for parsing the HTML documents as XML documents. That way you would be able to parse all the tags and attributes with a lot of flexibility. There are plenty of XML parsers available for C++.

This library looks compact simple and efficient (available for multiple platforms): https://github.com/leethomason/tinyxml2

Another one is: http://pugixml.org/

This link may help you if you want to get rid of MSHTML dependency: http://www.codeproject.com/Articles/30342/Remove-Microsoft-mshtml-dependency

like image 125
cpz Avatar answered Jan 13 '23 16:01

cpz