Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine if a string contains a base64 string inside of it

I'm trying to figure out a way to parse out a base64 string from with a larger string.

I have the string "Hello <base64 content> World" and I want to be able to parse out the base64 content and convert it back to a string. "Hello Awesome World"

Answers in C# preferred.

Edit: Updated with a more real example.

--abcdef
\n
Content-Type: Text/Plain;
Content-Transfer-Encoding: base64
\n
<base64 content>
\n
--abcdef--

This is taken from 1 sample. The problem is that the Content.... vary quite a bit from one record to the next.

like image 665
Adam Avatar asked Oct 04 '10 18:10

Adam


People also ask

How do you check if a string is base64 encoded or not?

In base64 encoding, the character set is [A-Z, a-z, 0-9, and + /] . If the rest length is less than 4, the string is padded with '=' characters. ^([A-Za-z0-9+/]{4})* means the string starts with 0 or more base64 groups.

How do you check whether the string is base64 encoded or not in JavaScript?

To determine if a string is a base64 string using JavaScript, we can check if a base64 string against a regex. For instance, we can write: const base64regex = /^([0-9a-zA-Z+/]{4})*(([0-9a-zA-Z+/]{2}==)|([0-9a-zA-Z+/]{3}=))?

What is == in base64?

The equals sign "=" represents a padding, usually seen at the end of a Base64 encoded sequence. The size in bytes is divisible by three (bits divisible by 24): All bits are encoded normally.

Can base64 string contain?

Base64-encoded string can contain white-spaces but the characters are not significant. So it's ok if database trims spaces. As a matter of fact, the original MIME specification recommends to break Base64 strings into lines of 72 characters. base64Binary of XML may also include newlines, tabs, spaces.


2 Answers

There is no reliable way to do it. How would you know that, for instance, "Hello" is not a base64 string ? OK, it's a bad example because base64 is supposed to be padded so that the length is a multiple of 4, but what about "overflow" ? It's 8-character long, it is a valid base64 string (it would decode to "¢÷«~Z0"), even though it's obviously a normal word to a human reader. There's just no way you can tell for sure whether a word is a normal word or base64 encoded text.

The fact that you have base64 encoded text embedded in normal text is clearly a design mistake, I suggest you do something about it rather that trying to do something impossible...

like image 82
Thomas Levesque Avatar answered Oct 23 '22 07:10

Thomas Levesque


In short form you could:

  • split the string on any chars that are not valid base64 data or padding
  • try to convert each token
  • if the conversion succeeds, call replace on the original string to switch the token with the converted value

In code:

var delimiters = new char[] { /* non-base64 ASCII chars */ };
var possibles = value.Split(delimiters, StringSplitOptions.RemoveEmptyEntries);
//need to tweak to include padding chars in matches, but still split on padding?
//maybe better off creating a regex to match base64 + padding
//and using Regex.Split?

foreach(var match in possibles)
{
    try
    {
        var converted = Convert.FromBase64String(match);
        var text = System.Text.Encoding.UTF8.GetString(converted);
        if(!string.IsNullOrEmpty(text))
        {
            value = value.Replace(match, text);
        }
    } 
    catch (System.ArgumentNullException) 
    {
        //handle it
    }
    catch (System.FormatException) 
    {
        //handle it
    }
}

Without a delimiter though, you can end up converting non-base64 text that happens to be also be valid as base64 encoded text.

Looking at your example of trying to convert "Hello QXdlc29tZQ== World" to "Hello Awesome World" the above algorithm could easily generate something like "ée¡Ý•Í½µ”¢¹]" by trying to convert the whole string from base64 since there is no delimiter between plain and encoded text.

Update (based on comments):

If there are no '\n's in the base64 content and it is always preceded by "Content-Transfer-Encoding: base64\n", then there is a way:

  • split the string on '\n'
  • iterate over all the tokens until a token ends in "Content-Transfer-Encoding: base64"
  • the next token (if there are any) should be decoded (if possible) and then the replacement should be made in the original string
  • return to iterating until out of tokens

In code:

private string ConvertMixedUpTextAndBase64(string value)
{
    var delimiters = new char[] { '\n' };
    var possibles = value.Split(delimiters, 
                                StringSplitOptions.RemoveEmptyEntries);

    for (int i = 0; i < possibles.Length - 1; i++)
    {
        if (possibles[i].EndsWith("Content-Transfer-Encoding: base64"))
        {
            var nextTokenPlain = DecodeBase64(possibles[i + 1]);
            if (!string.IsNullOrEmpty(nextTokenPlain))
            {
                value = value.Replace(possibles[i + 1], nextTokenPlain);
                i++;
            }
        }                
    }
    return value;
}

private string DecodeBase64(string text)
{
    string result = null;
    try
    {
        var converted = Convert.FromBase64String(text);
        result = System.Text.Encoding.UTF8.GetString(converted);
    }
    catch (System.ArgumentNullException)
    {
        //handle it
    }
    catch (System.FormatException)
    {
        //handle it
    }
    return result;
}
like image 4
jball Avatar answered Oct 23 '22 08:10

jball