Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if a string is XML formatted [duplicate]

Tags:

string

xml

ruby

I'm wondering if there's a function in Ruby like is_xml?(string) to identify if a given string is XML formatted.

like image 766
mCY Avatar asked Dec 27 '12 09:12

mCY


People also ask

How check string is XML or not in C#?

string xml = ""; XDocument document = XDocument. Parse(xml);

Is XML a string?

XML Schema defines in a way what an XML document contains, therefore, XSD defines the string so, it can be defined as a value that contains character strings also has Unicode character given by XML and represented using the type xs: string, while this type has a white space character and maintained by the processor as ...

How do I find XML errors?

To install the XML tools plugin, download the plugin zip file, and extract the contents to where you have installed Notepad++ (such as C:\Program Files\Notepad++). Then restart Notepad++, open the XML file you wish to check, click on the "Plugins" menu at the top, select "XML Tools" and click on "Check XML syntax now."


2 Answers

Nokogiri's parse uses a simple regex test looking for <html> in an attempt to determine if the data to be parsed is HTML or XML:

string =~ /^s*<[^Hh>]*html/ # Probably html

Something similar, looking for the XML declaration would be a starting point:

string = '<?xml version="1.0"?><foo><bar></bar></foo>'
string.strip[/\A<\?xml/]
=> "<?xml"

If that returns anything other than nil the string contains the XML declaration. It's important to test for this because an empty string will fool the next steps.

Nokogiri::XML('').errors.empty?
=> true

Nokogiri also has the errors method, which will return an array of errors after attempting to parse a document that is malformed. Testing that for any size would help:

Nokogiri::XML('<foo>').errors
=> [#<Nokogiri::XML::SyntaxError: Premature end of data in tag foo line 1>]
Nokogiri::XML('<foo>').errors.empty?
=> false

Nokogiri::XML(string).errors.empty?
=> true

would be true if the document is syntactically valid.


I just tested Nokogiri to see if it could tell the difference between a regular string vs. true XML:

[2] (pry) main: 0> doc = Nokogiri::XML('foo').errors
[
    [0] #<Nokogiri::XML::SyntaxError: Start tag expected, '<' not found>
]

So, you can loop through your files and sort them into XML and non-XML easily:

require 'nokogiri'

[
  '',
  'foo',
  '<xml></xml>'
].group_by{ |s| (s.strip > '') && Nokogiri::XML(s).errors.empty? }
=> {false=>["", "foo"], true=>["<xml></xml>"]}

Assign the result of group_by to a variable, and you'll have a hash you can check for non-XML (false) or XML (true).

like image 144
the Tin Man Avatar answered Oct 23 '22 22:10

the Tin Man


There is no such function in Ruby's String class or Active Support's String extensions, but you can use Nokogiri to detect errors in XML:

begin
  bad_doc = Nokogiri::XML(badly_formed) { |config| config.strict }
rescue Nokogiri::XML::SyntaxError => e
  puts "caught exception: #{e}"
end
like image 31
nurettin Avatar answered Oct 23 '22 21:10

nurettin