Sorry for the heading...
So I want to extract text from the Clipboard. This text is copied from a web page (in the browser). In my case it's a table with some data.
So I have extracted the data (it comes as a string) with the following code:
IDataObject iData = Clipboard.GetDataObject();
if (iData.GetDataPresent(DataFormats.Html))
{
string s = (string)iData.GetData(DataFormats.Html);
}
And what I get from that (what s contains) is the following:
Version:0.9
StartHTML:0000000397
EndHTML:0000004086
StartFragment:0000000433
EndFragment:0000004050
SourceURL:Bla Bla Bla
<html>
<body>
<!--StartFragment--><table class="listing tickets">Bla Bla Bla</table><!--EndFragment-->
</body>
</html>
So, again. Is there any standard class that parses this data or should I simply create one myself?
OK, so the answer seem to be no! which surprised me some...
Anyway. I made my own helper class which maybe can help you to. This is only one of many possible solutions. For my application it works nice to return null if nothing is found, maybe you want an exception instead. Also keep in mind that I am making this as a side project, so there are no extensive testing of the code, and therefore I make NO guarantees that this works.
public class ClipboardHtmlOutput
{
public Double Version { get; private set; }
public String Source { get; private set; }
public String Input { get; private set; }
//public String Html { get { return Input.Substring(startHTML, (endHTML - startHTML)); } }
public String Html { get { return Input.Substring(startHTML, Math.Min(endHTML - startHTML, Input.Length - startHTML)); } }
public String Fragment { get { return Input.Substring(startFragment, (endFragment - startFragment)); } }
private int startHTML;
private int endHTML;
private int startFragment;
private int endFragment;
public static ClipboardHtmlOutput ParseString(string s)
{
ClipboardHtmlOutput html = new ClipboardHtmlOutput();
string pattern = @"Version:(?<version>[0-9]+(?:\.[0-9]*)?).+StartHTML:(?<startH>\d*).+EndHTML:(?<endH>\d*).+StartFragment:(?<startF>\d+).+EndFragment:(?<endF>\d*).+SourceURL:(?<source>f|ht{1}tps?://[-a-zA-Z0-9@:%_\+.~#?&//=]+)";
Match match = Regex.Match(s, pattern, RegexOptions.Singleline);
if (match.Success)
{
try
{
html.Input = s;
html.Version = Double.Parse(match.Groups["version"].Value, CultureInfo.InvariantCulture);
html.Source = match.Groups["source"].Value;
html.startHTML = int.Parse(match.Groups["startH"].Value);
html.endHTML = int.Parse(match.Groups["endH"].Value);
html.startFragment = int.Parse(match.Groups["startF"].Value);
html.endFragment = int.Parse(match.Groups["endF"].Value);
}
catch (Exception fe)
{
return null;
}
return html;
}
return null;
}
}
Usage could be something like this:
IDataObject iData = Clipboard.GetDataObject();
if (iData.GetDataPresent(DataFormats.Html))
{
ClipboardHtmlOutput cho = ClipboardHtmlOutput.ParseString((string)iData.GetData(DataFormats.Html));
XmlDocument xml = new XmlDocument();
xml.LoadXml(cho.Fragment);
}
the following method is the approach from Microsoft. This method is contained in class HtmlParser in the sample 'XAML to HTML Conversion Demo' you can download here: https://code.msdn.microsoft.com/windowsdesktop/XAML-to-HTML-Conversion-ed25a674/view/SourceCode.
Additional info about 'HTML Clipboard Format' you can find here: https://msdn.microsoft.com/en-us/library/aa767917(v=vs.85).aspx
/// <summary>
/// Extracts Html string from clipboard data by parsing header information in htmlDataString
/// </summary>
/// <param name="htmlDataString">
/// String representing Html clipboard data. This includes Html header
/// </param>
/// <returns>
/// String containing only the Html data part of htmlDataString, without header
/// </returns>
internal static string ExtractHtmlFromClipboardData(string htmlDataString)
{
int startHtmlIndex = htmlDataString.IndexOf("StartHTML:");
if (startHtmlIndex < 0)
{
return "ERROR: Urecognized html header";
}
// TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
// which could be wrong assumption. We need to implement more flrxible parsing here
startHtmlIndex = Int32.Parse(htmlDataString.Substring(startHtmlIndex + "StartHTML:".Length, "0123456789".Length));
if (startHtmlIndex < 0 || startHtmlIndex > htmlDataString.Length)
{
return "ERROR: Urecognized html header";
}
int endHtmlIndex = htmlDataString.IndexOf("EndHTML:");
if (endHtmlIndex < 0)
{
return "ERROR: Urecognized html header";
}
// TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
// which could be wrong assumption. We need to implement more flrxible parsing here
endHtmlIndex = Int32.Parse(htmlDataString.Substring(endHtmlIndex + "EndHTML:".Length, "0123456789".Length));
if (endHtmlIndex > htmlDataString.Length)
{
endHtmlIndex = htmlDataString.Length;
}
return htmlDataString.Substring(startHtmlIndex, endHtmlIndex - startHtmlIndex);
}
25.02.2015 Addition
Following my implementation. I had to take care about UTF-8 (see at and of method)
/// <summary>
/// Extracts selected Html fragment string from clipboard data by parsing header information
/// in htmlDataString
/// </summary>
/// <param name="htmlDataString">
/// String representing Html clipboard data. This includes Html header
/// </param>
/// <returns>
/// String containing only the Html selection part of htmlDataString, without header
/// </returns>
internal static string ExtractHtmlFragmentFromClipboardData(string htmlDataString)
{
// HTML Clipboard Format
// (https://msdn.microsoft.com/en-us/library/aa767917(v=vs.85).aspx)
// The fragment contains valid HTML representing the area the user has selected. This
// includes the information required for basic pasting of an HTML fragment, as follows:
// - Selected text.
// - Opening tags and attributes of any element that has an end tag within the selected text.
// - End tags that match the included opening tags.
// The fragment should be preceded and followed by the HTML comments <!--StartFragment--> and
// <!--EndFragment--> (no space allowed between the !-- and the text) to indicate where the
// fragment starts and ends. So the start and end of the fragment are indicated by these
// comments as well as by the StartFragment and EndFragment byte counts. Though redundant,
// this makes it easier to find the start of the fragment (from the byte count) and mark the
// position of the fragment directly in the HTML tree.
// Byte count from the beginning of the clipboard to the start of the fragment.
int startFragmentIndex = htmlDataString.IndexOf("StartFragment:");
if (startFragmentIndex < 0)
{
return "ERROR: Unrecognized html header";
}
// TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
// which could be wrong assumption. We need to implement more flrxible parsing here
startFragmentIndex = Int32.Parse(htmlDataString.Substring(startFragmentIndex + "StartFragment:".Length, 10));
if (startFragmentIndex < 0 || startFragmentIndex > htmlDataString.Length)
{
return "ERROR: Unrecognized html header";
}
// Byte count from the beginning of the clipboard to the end of the fragment.
int endFragmentIndex = htmlDataString.IndexOf("EndFragment:");
if (endFragmentIndex < 0)
{
return "ERROR: Unrecognized html header";
}
// TODO: We assume that indices represented by strictly 10 zeros ("0123456789".Length),
// which could be wrong assumption. We need to implement more flrxible parsing here
endFragmentIndex = Int32.Parse(htmlDataString.Substring(endFragmentIndex + "EndFragment:".Length, 10));
if (endFragmentIndex > htmlDataString.Length)
{
endFragmentIndex = htmlDataString.Length;
}
// CF_HTML is entirely text format and uses the transformation format UTF-8
byte[] bytes = Encoding.UTF8.GetBytes(htmlDataString);
return Encoding.UTF8.GetString(bytes, startFragmentIndex, endFragmentIndex - startFragmentIndex);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With