Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize a URL in Java?

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.

Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.

I'll handcode something for now and keep an eye on this question.

EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

like image 440
dfrankow Avatar asked Jun 07 '10 22:06

dfrankow


People also ask

What is canonical URL java?

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

What is normalize incoming URLs?

URL normalization modifies separators, encoded elements, and literal bytes in incoming URLs so that they conform to a consistent formatting standard. For example, consider a firewall rule that blocks requests whose URLs match www.example.com/hello .


7 Answers

Have you taken a look at the URI class?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

like image 140
Nitrodist Avatar answered Sep 30 '22 11:09

Nitrodist


I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:

/**
 * - Covert the scheme and host to lowercase (done by java.net.URL)
 * - Normalize the path (done by java.net.URI)
 * - Add the port number.
 * - Remove the fragment (the part after the #).
 * - Remove trailing slash.
 * - Sort the query string params.
 * - Remove some query string params like "utm_*" and "*session*".
 */
public class NormalizeURL
{
    public static String normalize(final String taintedURL) throws MalformedURLException
    {
        final URL url;
        try
        {
            url = new URI(taintedURL).normalize().toURL();
        }
        catch (URISyntaxException e) {
            throw new MalformedURLException(e.getMessage());
        }

        final String path = url.getPath().replace("/$", "");
        final SortedMap<String, String> params = createParameterMap(url.getQuery());
        final int port = url.getPort();
        final String queryString;

        if (params != null)
        {
            // Some params are only relevant for user tracking, so remove the most commons ones.
            for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
            {
                final String key = i.next();
                if (key.startsWith("utm_") || key.contains("session"))
                {
                    i.remove();
                }
            }
            queryString = "?" + canonicalize(params);
        }
        else
        {
            queryString = "";
        }

        return url.getProtocol() + "://" + url.getHost()
            + (port != -1 && port != 80 ? ":" + port : "")
            + path + queryString;
    }

    /**
     * Takes a query string, separates the constituent name-value pairs, and
     * stores them in a SortedMap ordered by lexicographical order.
     * @return Null if there is no query string.
     */
    private static SortedMap<String, String> createParameterMap(final String queryString)
    {
        if (queryString == null || queryString.isEmpty())
        {
            return null;
        }

        final String[] pairs = queryString.split("&");
        final Map<String, String> params = new HashMap<String, String>(pairs.length);

        for (final String pair : pairs)
        {
            if (pair.length() < 1)
            {
                continue;
            }

            String[] tokens = pair.split("=", 2);
            for (int j = 0; j < tokens.length; j++)
            {
                try
                {
                    tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
                }
                catch (UnsupportedEncodingException ex)
                {
                    ex.printStackTrace();
                }
            }
            switch (tokens.length)
            {
                case 1:
                {
                    if (pair.charAt(0) == '=')
                    {
                        params.put("", tokens[0]);
                    }
                    else
                    {
                        params.put(tokens[0], "");
                    }
                    break;
                }
                case 2:
                {
                    params.put(tokens[0], tokens[1]);
                    break;
                }
            }
        }

        return new TreeMap<String, String>(params);
    }

    /**
     * Canonicalize the query string.
     *
     * @param sortedParamMap Parameter name-value pairs in lexicographical order.
     * @return Canonical form of query string.
     */
    private static String canonicalize(final SortedMap<String, String> sortedParamMap)
    {
        if (sortedParamMap == null || sortedParamMap.isEmpty())
        {
            return "";
        }

        final StringBuffer sb = new StringBuffer(350);
        final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();

        while (iter.hasNext())
        {
            final Map.Entry<String, String> pair = iter.next();
            sb.append(percentEncodeRfc3986(pair.getKey()));
            sb.append('=');
            sb.append(percentEncodeRfc3986(pair.getValue()));
            if (iter.hasNext())
            {
                sb.append('&');
            }
        }

        return sb.toString();
    }

    /**
     * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
     * according to the RFC, so we make the extra replacements.
     *
     * @param string Decoded string.
     * @return Encoded string per RFC 3986.
     */
    private static String percentEncodeRfc3986(final String string)
    {
        try
        {
            return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
        }
        catch (UnsupportedEncodingException e)
        {
            return string;
        }
    }
}
like image 34
Amy B Avatar answered Sep 30 '22 13:09

Amy B


Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

like image 45
H6. Avatar answered Sep 30 '22 11:09

H6.


No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.

e.g. http://ACME.com/./foo%26bar becomes:

http://acme.com/foo&bar

URI's normalize() does not do this.

like image 36
Randy Hudson Avatar answered Sep 30 '22 11:09

Randy Hudson


The RL library: https://github.com/backchatio/rl goes quite a ways beyond java.net.URL.normalize(). It's in Scala, but I imagine it should be useable from Java.

like image 30
pdxleif Avatar answered Sep 30 '22 12:09

pdxleif


You can do this with the Restlet framework using Reference.normalize(). You should also be able to remove the elements you don't need quite conveniently with this class.

like image 29
Bruno Avatar answered Sep 30 '22 13:09

Bruno


In Java, normalize parts of a URL

Example of a URL: https://i0.wp.com:55/lplresearch.com/wp-content/feb.png?ssl=1&myvar=2#myfragment

protocol:        https 
domain name:     i0.wp.com 
subdomain:       i0 
port:            55 
path:            /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1 
query:           ?ssl=1" 
parameters:      &myvar=2 
fragment:        #myfragment 

Code to do the URL parsing:

import java.util.*; 
import java.util.regex.*; 
public class regex { 
    public static String getProtocol(String the_url){ 
        Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getParameters(String the_url){ 
        Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.@!$&''()*+,;=]+)(#.*)*$");
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getFragment(String the_url){ 
        Pattern p = Pattern.compile(".*(#.*)$"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static void main(String[] args){ 
        String the_url = 
            "https://i0.wp.com:55/lplresearch.com/" + 
            "wp-content/feb.png?ssl=1&myvar=2#myfragment"; 
        System.out.println(getProtocol(the_url)); 
        System.out.println(getFragment(the_url)); 
        System.out.println(getParameters(the_url)); 
    }   
} 

Prints

https
#myfragment
?ssl=1&myvar=2

You can then push and pull on the parts of the URL until they are up to muster.

like image 31
Eric Leschinski Avatar answered Sep 30 '22 12:09

Eric Leschinski