I'm trying to write two functions escape(text, delimiter)
and unescape(text, delimiter)
with the following properties:
The result of escape
does not contain delimiter
.
unescape
is the reverse of escape
, i.e.
unescape(escape(text, delimiter), delimiter) == text
for all values of text
and delimiter
It is OK to restrict the allowed values of delimiter
.
Background: I want to create a delimiter-separated string of values. To be able to extract the same list out of the string again, I must ensure that the individual, separated strings do not contain the separator.
What I've tried: I came up with a simple solution (pseudo-code):
escape(text, delimiter): return text.Replace("\", "\\").Replace(delimiter, "\d")
unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\\", "\")
but discovered that property 2 failed on the test string "\d<delimiter>"
. Currently, I have the following working solution
escape(text, delimiter): return text.Replace("\", "\b").Replace(delimiter, "\d")
unescape(text, delimiter): return text.Replace("\d", delimiter).Replace("\b", "\")
which seems to work, as long as delimiter
is not \
, b
or d
(which is fine, I don't want to use those as delimiters anyway). However, since I have not formally proven its correctness, I'm afraid that I have missed some case where one of the properties is violated. Since this is such a common problem, I assume that there is already a "well-known proven-correct" algorithm for this, hence my question (see title).
In particular, the \n escape sequence represents the newline character. A \n in a printf format string tells awk to start printing output at the beginning of a newline.
In string and character sequences, when you want the backslash to represent itself (rather than the beginning of an escape sequence), you must use a \\ backslash escape sequence.
Character combinations consisting of a backslash (\) followed by a letter or by a combination of digits are called "escape sequences." To represent a newline character, single quotation mark, or certain other characters in a character constant, you must use escape sequences.
An escape sequence is a set of characters used in string literals that have a special meaning, such as a new line, a new page, or a tab. For example, the escape sequence \n represents a new line character. To ignore an escape sequence in your search, prepend a backslash character to the escape sequence.
Your first algorithm is correct.
The error is in the implementation of unescape(): you need to replace both \d
by delimiter
and \\
by \
, in the same pass.
You can't use several calls to Replace() like this.
Here's some sample C# code for safe quoting of delimiter-separated strings:
static string QuoteSeparator(string str,
char separator, char quoteChar, char otherChar) // "~" -> "~~" ";" -> "~s"
{
var sb = new StringBuilder(str.Length);
foreach (char c in str)
{
if (c == quoteChar)
{
sb.Append(quoteChar);
sb.Append(quoteChar);
}
else if (c == separator)
{
sb.Append(quoteChar);
sb.Append(otherChar);
}
else
{
sb.Append(c);
}
}
return sb.ToString(); // no separator in the result -> Join/Split is safe
}
static string UnquoteSeparator(string str,
char separator, char quoteChar, char otherChar) // "~~" -> "~" "~s" -> ";"
{
var sb = new StringBuilder(str.Length);
bool isQuoted = false;
foreach (char c in str)
{
if (isQuoted)
{
if (c == otherChar)
sb.Append(separator);
else
sb.Append(c);
isQuoted = false;
}
else
{
if (c == quoteChar)
isQuoted = true;
else
sb.Append(c);
}
}
if (isQuoted)
throw new ArgumentException("input string is not correctly quoted");
return sb.ToString(); // ";" are restored
}
/// <summary>
/// Encodes the given strings as a single string.
/// </summary>
/// <param name="input">The strings.</param>
/// <param name="separator">The separator.</param>
/// <param name="quoteChar">The quote char.</param>
/// <param name="otherChar">The other char.</param>
/// <returns></returns>
public static string QuoteAndJoin(this IEnumerable<string> input,
char separator = ';', char quoteChar = '~', char otherChar = 's')
{
CommonHelper.CheckNullReference(input, "input");
if (separator == quoteChar || quoteChar == otherChar || separator == otherChar)
throw new ArgumentException("cannot quote: ambiguous format");
return string.Join(new string(separator, 1), (from str in input select QuoteSeparator(str, separator, quoteChar, otherChar)).ToArray());
}
/// <summary>
/// Decodes the strings encoded in a single string.
/// </summary>
/// <param name="encoded">The encoded.</param>
/// <param name="separator">The separator.</param>
/// <param name="quoteChar">The quote char.</param>
/// <param name="otherChar">The other char.</param>
/// <returns></returns>
public static IEnumerable<string> SplitAndUnquote(this string encoded,
char separator = ';', char quoteChar = '~', char otherChar = 's')
{
CommonHelper.CheckNullReference(encoded, "encoded");
if (separator == quoteChar || quoteChar == otherChar || separator == otherChar)
throw new ArgumentException("cannot unquote: ambiguous format");
return from s in encoded.Split(separator) select UnquoteSeparator(s, separator, quoteChar, otherChar);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With