I'm wondering if there's a function in Ruby like is_xml?(string)
to identify if a given string is XML formatted.
string xml = ""; XDocument document = XDocument. Parse(xml);
XML Schema defines in a way what an XML document contains, therefore, XSD defines the string so, it can be defined as a value that contains character strings also has Unicode character given by XML and represented using the type xs: string, while this type has a white space character and maintained by the processor as ...
To install the XML tools plugin, download the plugin zip file, and extract the contents to where you have installed Notepad++ (such as C:\Program Files\Notepad++). Then restart Notepad++, open the XML file you wish to check, click on the "Plugins" menu at the top, select "XML Tools" and click on "Check XML syntax now."
Nokogiri's parse
uses a simple regex test looking for <html>
in an attempt to determine if the data to be parsed is HTML or XML:
string =~ /^s*<[^Hh>]*html/ # Probably html
Something similar, looking for the XML declaration would be a starting point:
string = '<?xml version="1.0"?><foo><bar></bar></foo>'
string.strip[/\A<\?xml/]
=> "<?xml"
If that returns anything other than nil
the string contains the XML declaration. It's important to test for this because an empty string will fool the next steps.
Nokogiri::XML('').errors.empty?
=> true
Nokogiri also has the errors
method, which will return an array of errors after attempting to parse a document that is malformed. Testing that for any size would help:
Nokogiri::XML('<foo>').errors
=> [#<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
Nokogiri::XML('<foo>').errors.empty?
=> false
Nokogiri::XML(string).errors.empty?
=> true
would be true if the document is syntactically valid.
I just tested Nokogiri to see if it could tell the difference between a regular string vs. true XML:
[2] (pry) main: 0> doc = Nokogiri::XML('foo').errors
[
[0] #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>
]
So, you can loop through your files and sort them into XML and non-XML easily:
require 'nokogiri'
[
'',
'foo',
'<xml></xml>'
].group_by{ |s| (s.strip > '') && Nokogiri::XML(s).errors.empty? }
=> {false=>["", "foo"], true=>["<xml></xml>"]}
Assign the result of group_by
to a variable, and you'll have a hash you can check for non-XML (false
) or XML (true
).
There is no such function in Ruby's String class or Active Support's String extensions, but you can use Nokogiri to detect errors in XML:
begin
bad_doc = Nokogiri::XML(badly_formed) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
puts "caught exception: #{e}"
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With