I am loading a lot of xml documents and some of them return errors like "hexadecimal value 0x12, is an invalid character" and there are different character. How to remove them?
C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...
In the real sense it has no meaning or full form. It was developed by Dennis Ritchie and Ken Thompson at AT&T bell Lab. First, they used to call it as B language then later they made some improvement into it and renamed it as C and its superscript as C++ which was invented by Dr.
What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.
C is a general-purpose language that most programmers learn before moving on to more complex languages. From Unix and Windows to Tic Tac Toe and Photoshop, several of the most commonly used applications today have been built on C. It is easy to learn because: A simple syntax with only 32 keywords.
I made a small research here.
Here is the ASCII table. There are 128 symbols Here is some small test code which adds every symbol from ASCII table and tries to load it as an XML document.
static public void RegexTry() { StreamReader stream = new StreamReader(@"test.xml"); string xmlfile = stream.ReadToEnd(); stream.Close(); string text = ""; for (int i = 0; i < 128; i++ ) { char t = (char) i; text = xmlfile.Replace('П', t); XmlDocument xml = new XmlDocument(); try { xml.LoadXml(text); } catch (Exception ex) { Console.WriteLine("Char("+i.ToString() +"): " + t + " => error! " + ex.Message); continue; } Console.WriteLine("Char(" + i.ToString() + "): " + t + " => fine!"); } Console.ReadKey(); }
As a result it returns:
Char(0): => error! '.', hexadecimal value 0x00, is an invalid character. Line 5, position 7. Char(1): => error! '', hexadecimal value 0x01, is an invalid character. Line 5, position 7. Char(2): => error! '', hexadecimal value 0x02, is an invalid character. Line 5, position 7. Char(3): => error! '', hexadecimal value 0x03, is an invalid character. Line 5, position 7. Char(4): => error! '', hexadecimal value 0x04, is an invalid character. Line 5, position 7. Char(5): => error! '', hexadecimal value 0x05, is an invalid character. Line 5, position 7. Char(6): => error! '', hexadecimal value 0x06, is an invalid character. Line 5, position 7. Char(7): => error! '', hexadecimal value 0x07, is an invalid character. Line 5, position 7. Char(8): => error! '', hexadecimal value 0x08, is an invalid character. Line 5, position 7. Char(9): => fine! Char(10): => fine! Char(11): => error! '', hexadecimal value 0x0B, is an invalid character. Line 5, position 7. Char(12): => error! '', hexadecimal value 0x0C, is an invalid character. Line 5, position 7. Char(13): => fine! Char(14): => error! '', hexadecimal value 0x0E, is an invalid character. Line 5, position 7. Char(15): => error! '', hexadecimal value 0x0F, is an invalid character. Line 5, position 7. Char(16): => error! '', hexadecimal value 0x10, is an invalid character. Line 5, position 7. Char(17): => error! '', hexadecimal value 0x11, is an invalid character. Line 5, position 7. Char(18): => error! '', hexadecimal value 0x12, is an invalid character. Line 5, position 7. Char(19): => error! '', hexadecimal value 0x13, is an invalid character. Line 5, position 7. Char(20): => error! '', hexadecimal value 0x14, is an invalid character. Line 5, position 7. Char(21): => error! '', hexadecimal value 0x15, is an invalid character. Line 5, position 7. Char(22): => error! '', hexadecimal value 0x16, is an invalid character. Line 5, position 7. Char(23): => error! '', hexadecimal value 0x17, is an invalid character. Line 5, position 7. Char(24): => error! '', hexadecimal value 0x18, is an invalid character. Line 5, position 7. Char(25): => error! '', hexadecimal value 0x19, is an invalid character. Line 5, position 7. Char(26): => error! '', hexadecimal value 0x1A, is an invalid character. Line 5, position 7. Char(27): => error! '', hexadecimal value 0x1B, is an invalid character. Line 5, position 7. Char(28): => error! '', hexadecimal value 0x1C, is an invalid character. Line 5, position 7. Char(29): => error! '', hexadecimal value 0x1D, is an invalid character. Line 5, position 7. Char(30): => error! '', hexadecimal value 0x1E, is an invalid character. Line 5, position 7. Char(31): => error! '', hexadecimal value 0x1F, is an invalid character. Line 5, position 7. Char(32): => fine! Char(33): ! => fine! Char(34): " => fine! Char(35): # => fine! Char(36): $ => fine! Char(37): % => fine! Char(38): => error! An error occurred while parsing EntityName. Line 5, position 8. Char(39): ' => fine! Char(40): ( => fine! Char(41): ) => fine! Char(42): * => fine! Char(43): + => fine! Char(44): , => fine! Char(45): - => fine! Char(46): . => fine! Char(47): / => fine! Char(48): 0 => fine! Char(49): 1 => fine! Char(50): 2 => fine! Char(51): 3 => fine! Char(52): 4 => fine! Char(53): 5 => fine! Char(54): 6 => fine! Char(55): 7 => fine! Char(56): 8 => fine! Char(57): 9 => fine! Char(58): : => fine! Char(59): ; => fine! Char(60): => error! The '<' character, hexadecimal value 0x3C, cannot be included in a name. Line 5, position 13. Char(61): = => fine! Char(62): > => fine! Char(63): ? => fine! Char(64): @ => fine! Char(65): A => fine! Char(66): B => fine! Char(67): C => fine! Char(68): D => fine! Char(69): E => fine! Char(70): F => fine! Char(71): G => fine! Char(72): H => fine! Char(73): I => fine! Char(74): J => fine! Char(75): K => fine! Char(76): L => fine! Char(77): M => fine! Char(78): N => fine! Char(79): O => fine! Char(80): P => fine! Char(81): Q => fine! Char(82): R => fine! Char(83): S => fine! Char(84): T => fine! Char(85): U => fine! Char(86): V => fine! Char(87): W => fine! Char(88): X => fine! Char(89): Y => fine! Char(90): Z => fine! Char(91): [ => fine! Char(92): \ => fine! Char(93): ] => fine! Char(94): ^ => fine! Char(95): _ => fine! Char(96): ` => fine! Char(97): a => fine! Char(98): b => fine! Char(99): c => fine! Char(100): d => fine! Char(101): e => fine! Char(102): f => fine! Char(103): g => fine! Char(104): h => fine! Char(105): i => fine! Char(106): j => fine! Char(107): k => fine! Char(108): l => fine! Char(109): m => fine! Char(110): n => fine! Char(111): o => fine! Char(112): p => fine! Char(113): q => fine! Char(114): r => fine! Char(115): s => fine! Char(116): t => fine! Char(117): u => fine! Char(118): v => fine! Char(119): w => fine! Char(120): x => fine! Char(121): y => fine! Char(122): z => fine! Char(123): { => fine! Char(124): | => fine! Char(125): } => fine! Char(126): ~ => fine! Char(127): => fine!
You can see there are a lot of symbols which can't be in XML code. To replace them we can use Reqex.Replace
static string ReplaceHexadecimalSymbols(string txt) { string r = "[\x00-\x08\x0B\x0C\x0E-\x1F\x26]"; return Regex.Replace(txt, r,"",RegexOptions.Compiled); }
PS. Sorry if everybody knew that.
The XML specification defines the valid characters like this:
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
As you can see #x12
is not a valid character in an XML document.
You ask how to remove them but I think that is not the question you should be asking. They should simply not be present. You should reject any such document as mal-formed. Simply removing invalid characters suppresses the real problem.
If you are creating the documents in question then you need to fix the code that generates it so that it generates valid XML.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With