Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Standard class that parses Clipboard functionality GetData(DataFormats.Html) output

Tags:

c#

clipboard

Sorry for the heading...

So I want to extract text from the Clipboard. This text is copied from a web page (in the browser). In my case it's a table with some data.

So I have extracted the data (it comes as a string) with the following code:

IDataObject iData = Clipboard.GetDataObject();

if (iData.GetDataPresent(DataFormats.Html))
{
    string s = (string)iData.GetData(DataFormats.Html);
}

And what I get from that (what s contains) is the following:

Version:0.9
StartHTML:0000000397
EndHTML:0000004086
StartFragment:0000000433
EndFragment:0000004050
SourceURL:Bla Bla Bla
<html>
<body>
<!--StartFragment--><table class="listing tickets">Bla Bla Bla</table><!--EndFragment-->
</body>
</html>

So, again. Is there any standard class that parses this data or should I simply create one myself?

like image 487
Markus Avatar asked Jan 30 '13 12:01

Markus


2 Answers

OK, so the answer seem to be no! which surprised me some...

Anyway. I made my own helper class which maybe can help you to. This is only one of many possible solutions. For my application it works nice to return null if nothing is found, maybe you want an exception instead. Also keep in mind that I am making this as a side project, so there are no extensive testing of the code, and therefore I make NO guarantees that this works.

public class ClipboardHtmlOutput
{
    public Double Version { get; private set; }
    public String Source { get; private set; }
    public String Input { get; private set; }
    //public String Html { get { return Input.Substring(startHTML, (endHTML - startHTML)); } }
    public String Html { get { return Input.Substring(startHTML, Math.Min(endHTML - startHTML, Input.Length - startHTML)); } }
    public String Fragment { get { return Input.Substring(startFragment, (endFragment - startFragment)); } }

    private int startHTML;
    private int endHTML;
    private int startFragment;
    private int endFragment;

    public static ClipboardHtmlOutput ParseString(string s)
    {
        ClipboardHtmlOutput html = new ClipboardHtmlOutput();

        string pattern = @"Version:(?<version>[0-9]+(?:\.[0-9]*)?).+StartHTML:(?<startH>\d*).+EndHTML:(?<endH>\d*).+StartFragment:(?<startF>\d+).+EndFragment:(?<endF>\d*).+SourceURL:(?<source>f|ht{1}tps?://[-a-zA-Z0-9@:%_\+.~#?&//=]+)";
        Match match = Regex.Match(s, pattern, RegexOptions.Singleline);

        if (match.Success)
        {
            try
            {
                html.Input = s;
                html.Version = Double.Parse(match.Groups["version"].Value, CultureInfo.InvariantCulture);
                html.Source = match.Groups["source"].Value;
                html.startHTML = int.Parse(match.Groups["startH"].Value);
                html.endHTML = int.Parse(match.Groups["endH"].Value);
                html.startFragment = int.Parse(match.Groups["startF"].Value);
                html.endFragment = int.Parse(match.Groups["endF"].Value);
            }
            catch (Exception fe)
            {
                return null;
            }
            return html;
        }
        return null;
    }
}

Usage could be something like this:

IDataObject iData = Clipboard.GetDataObject();

if (iData.GetDataPresent(DataFormats.Html))
{
    ClipboardHtmlOutput cho = ClipboardHtmlOutput.ParseString((string)iData.GetData(DataFormats.Html));
    XmlDocument xml = new XmlDocument();
    xml.LoadXml(cho.Fragment);
}
like image 154
Markus Avatar answered Nov 28 '22 23:11

Markus


the following method is the approach from Microsoft. This method is contained in class HtmlParser in the sample 'XAML to HTML Conversion Demo' you can download here: https://code.msdn.microsoft.com/windowsdesktop/XAML-to-HTML-Conversion-ed25a674/view/SourceCode.

Additional info about 'HTML Clipboard Format' you can find here: https://msdn.microsoft.com/en-us/library/aa767917(v=vs.85).aspx

/// <summary>
/// Extracts Html string from clipboard data by parsing header information in htmlDataString
/// </summary>
/// <param name="htmlDataString">
/// String representing Html clipboard data. This includes Html header
/// </param>
/// <returns>
/// String containing only the Html data part of htmlDataString, without header
/// </returns>
internal static string ExtractHtmlFromClipboardData(string htmlDataString)
{
    int startHtmlIndex = htmlDataString.IndexOf("StartHTML:");
    if (startHtmlIndex < 0)
    {
        return "ERROR: Urecognized html header";
    }
    // TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
    // which could be wrong assumption. We need to implement more flrxible parsing here
    startHtmlIndex = Int32.Parse(htmlDataString.Substring(startHtmlIndex + "StartHTML:".Length, "0123456789".Length));
    if (startHtmlIndex < 0 || startHtmlIndex > htmlDataString.Length)
    {
        return "ERROR: Urecognized html header";
    }

    int endHtmlIndex = htmlDataString.IndexOf("EndHTML:");
    if (endHtmlIndex < 0)
    {
        return "ERROR: Urecognized html header";
    }
    // TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
    // which could be wrong assumption. We need to implement more flrxible parsing here
    endHtmlIndex = Int32.Parse(htmlDataString.Substring(endHtmlIndex + "EndHTML:".Length, "0123456789".Length));
    if (endHtmlIndex > htmlDataString.Length)
    {
        endHtmlIndex = htmlDataString.Length;
    }

    return htmlDataString.Substring(startHtmlIndex, endHtmlIndex - startHtmlIndex);
}

25.02.2015 Addition

Following my implementation. I had to take care about UTF-8 (see at and of method)

/// <summary>
/// Extracts selected Html fragment string from clipboard data by parsing header information 
/// in htmlDataString
/// </summary>
/// <param name="htmlDataString">
/// String representing Html clipboard data. This includes Html header
/// </param>
/// <returns>
/// String containing only the Html selection part of htmlDataString, without header
/// </returns>
internal static string ExtractHtmlFragmentFromClipboardData(string htmlDataString)
{
    // HTML Clipboard Format
    // (https://msdn.microsoft.com/en-us/library/aa767917(v=vs.85).aspx)

    // The fragment contains valid HTML representing the area the user has selected. This 
    // includes the information required for basic pasting of an HTML fragment, as follows:
    //  - Selected text. 
    //  - Opening tags and attributes of any element that has an end tag within the selected text. 
    //  - End tags that match the included opening tags. 

    // The fragment should be preceded and followed by the HTML comments <!--StartFragment--> and 
    // <!--EndFragment--> (no space allowed between the !-- and the text) to indicate where the 
    // fragment starts and ends. So the start and end of the fragment are indicated by these 
    // comments as well as by the StartFragment and EndFragment byte counts. Though redundant, 
    // this makes it easier to find the start of the fragment (from the byte count) and mark the 
    // position of the fragment directly in the HTML tree.

    // Byte count from the beginning of the clipboard to the start of the fragment.
    int startFragmentIndex = htmlDataString.IndexOf("StartFragment:");
    if (startFragmentIndex < 0)
    {
        return "ERROR: Unrecognized html header";
    }
    // TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
    // which could be wrong assumption. We need to implement more flrxible parsing here
    startFragmentIndex = Int32.Parse(htmlDataString.Substring(startFragmentIndex + "StartFragment:".Length, 10));
    if (startFragmentIndex < 0 || startFragmentIndex > htmlDataString.Length)
    {
        return "ERROR: Unrecognized html header";
    }

    // Byte count from the beginning of the clipboard to the end of the fragment.
    int endFragmentIndex = htmlDataString.IndexOf("EndFragment:");
    if (endFragmentIndex < 0)
    {
        return "ERROR: Unrecognized html header";
    }
    // TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
    // which could be wrong assumption. We need to implement more flrxible parsing here
    endFragmentIndex = Int32.Parse(htmlDataString.Substring(endFragmentIndex + "EndFragment:".Length, 10));
    if (endFragmentIndex > htmlDataString.Length)
    {
        endFragmentIndex = htmlDataString.Length;
    }

    // CF_HTML is entirely text format and uses the transformation format UTF-8
    byte[] bytes = Encoding.UTF8.GetBytes(htmlDataString);
    return Encoding.UTF8.GetString(bytes, startFragmentIndex, endFragmentIndex - startFragmentIndex);
}
like image 34
zznobody Avatar answered Nov 28 '22 21:11

zznobody