I am trying to work on my own JSON parser. I have an input string that I want to tokenize:
input = "{ \"foo\": \"bar\", \"num\": 3}"
How do I remove the escape character \
so that it is not a part of my tokens?
Currently, my solution using delete
works:
tokens = input.delete('\\"').split("")
=> ["{", " ", "f", "o", "o", ":", " ", "b", "a", "r", ",", " ", "n", "u", "m", ":", " ", "3", "}"]
However, when I try to use gsub
, it fails to find any \"
.
tokens = input.gsub('\\"', '').split("")
=> ["{", " ", "\"", "f", "o", "o", "\"", ":", " ", "\"", "b", "a", "r", "\"", ",", " ", "\"", "n", "u", "m", "\"", ":", " ", "3", "}"]
I have two questions:
1. Why does gsub not work in this case?
2. How do I remove the backslash (escape) character? I currently have to remove the backslash character with the quotes to make this work.
When you write:
input = "{ \"foo\": \"bar\", \"num\": 3}"
The actual string stored in input is:
{ "foo": "bar", "num": 3}
The escape \"
here is interpreted by Ruby parser, so that it can distinguish between the boundary of a string (the left most and the right most "
), and a normal character "
in a string (the escaped ones).
String#delete
deletes a character set specified the first parameter, rather than a pattern. All characters that is in the first parameter will be removed. So by writing
input.delete('\\"')
You got a string with all \
and "
removed from input
, rather than a string with all \"
sequence removed from input
. This is wrong for your case. It may cause unexpected behavior some time later.
String#gsub
, however, substitute a pattern (either regular expression or plain string).
input.gsub('\\"', '')
means find all \"
(two characters in a sequence) and replace them with empty string. Since there isn't \
in input
, nothing got replaced. What you need is actually:
input.gsub('"', '')
You do not have backslashes in your string. You have quotes in your string, which need to be escaped when placed in a double-quoted string. Look:
input = "{ \"foo\": \"bar\", \"num\": 3}"
puts input
# => { "foo": "bar", "num": 3}
You are removing - phantoms.
input.delete('\\"')
will delete any characters in its argument. Thus, you delete any non-existent backslashes, and also delete all quotes. Without quotes, the default display method (inspect
) will not need to escape anything.
input.gsub('\\"', '')
will try to delete the sequence \"
, which does not exist, so gsub
ends up doing nothing.
Make sure you know what the difference between string representation (puts input.inspect
) and string content (puts input
) is, and note the backslashes as the artifacts of the representation.
That said, I have to echo emaillenin: writing a correct JSON parser is not simple, and you can't do it with regular expressions (or at least, not with regular regular expressions; it might be possible with Oniguruma). It needs a proper parser like treetop or rex/racc, since it has a lot of corner cases that are easy to miss (chief among them being, ironically, escaped characters).
input.gsub(/[\"]/,"")
will also work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With