<blockquote> URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent. </blockquote> Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many. Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better. I'll handcode something for now and keep an eye on this question. EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

Have you taken a look at the URI class? http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it: <pre class="prettyprint"><code>/** * - Covert the scheme and host to lowercase (done by java.net.URL) * - Normalize the path (done by java.net.URI) * - Add the port number. * - Remove the fragment (the part after the #). * - Remove trailing slash. * - Sort the query string params. * - Remove some query string params like "utm_*" and "*session*". */ public class NormalizeURL { public static String normalize(final String taintedURL) throws MalformedURLException { final URL url; try { url = new URI(taintedURL).normalize().toURL(); } catch (URISyntaxException e) { throw new MalformedURLException(e.getMessage()); } final String path = url.getPath().replace("/$", ""); final SortedMap<String, String> params = createParameterMap(url.getQuery()); final int port = url.getPort(); final String queryString; if (params != null) { // Some params are only relevant for user tracking, so remove the most commons ones. for (Iterator<String> i = params.keySet().iterator(); i.hasNext();) { final String key = i.next(); if (key.startsWith("utm_") || key.contains("session")) { i.remove(); } } queryString = "?" + canonicalize(params); } else { queryString = ""; } return url.getProtocol() + "://" + url.getHost() + (port != -1 && port != 80 ? ":" + port : "") + path + queryString; } /** * Takes a query string, separates the constituent name-value pairs, and * stores them in a SortedMap ordered by lexicographical order. * @return Null if there is no query string. */ private static SortedMap<String, String> createParameterMap(final String queryString) { if (queryString == null || queryString.isEmpty()) { return null; } final String[] pairs = queryString.split("&"); final Map<String, String> params = new HashMap<String, String>(pairs.length); for (final String pair : pairs) { if (pair.length() < 1) { continue; } String[] tokens = pair.split("=", 2); for (int j = 0; j < tokens.length; j++) { try { tokens[j] = URLDecoder.decode(tokens[j], "UTF-8"); } catch (UnsupportedEncodingException ex) { ex.printStackTrace(); } } switch (tokens.length) { case 1: { if (pair.charAt(0) == '=') { params.put("", tokens[0]); } else { params.put(tokens[0], ""); } break; } case 2: { params.put(tokens[0], tokens[1]); break; } } } return new TreeMap<String, String>(params); } /** * Canonicalize the query string. * * @param sortedParamMap Parameter name-value pairs in lexicographical order. * @return Canonical form of query string. */ private static String canonicalize(final SortedMap<String, String> sortedParamMap) { if (sortedParamMap == null || sortedParamMap.isEmpty()) { return ""; } final StringBuffer sb = new StringBuffer(350); final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator(); while (iter.hasNext()) { final Map.Entry<String, String> pair = iter.next(); sb.append(percentEncodeRfc3986(pair.getKey())); sb.append('='); sb.append(percentEncodeRfc3986(pair.getValue())); if (iter.hasNext()) { sb.append('&'); } } return sb.toString(); } /** * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode * according to the RFC, so we make the extra replacements. * * @param string Decoded string. * @return Encoded string per RFC 3986. */ private static String percentEncodeRfc3986(final String string) { try { return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~"); } catch (UnsupportedEncodingException e) { return string; } } } </code></pre>

No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc. e.g. <code>http://ACME.com/./foo%26bar</code> becomes: <code>http://acme.com/foo&bar</code> URI's <code>normalize()</code> does not do this.

You can do this with the Restlet framework using <code>Reference.normalize()</code>. You should also be able to remove the elements you don't need quite conveniently with this class.

How to normalize a URL in Java?

Tags:

java

url-rewriting

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Strategies include adding trailing slashes, https => http, etc. The Wikipedia page lists many.

Got a favorite method of doing this in Java? Perhaps a library (Nutch?), but I'm open. Smaller and fewer dependencies is better.

I'll handcode something for now and keep an eye on this question.

EDIT: I want to aggressively normalize to count URLs as the same if they refer to the same content. For example, I ignore the parameters utm_source, utm_medium, utm_campaign. For example, I ignore subdomain if the title is the same.

440

asked Jun 07 '10 22:06

dfrankow

7 Answers

Have you taken a look at the URI class?

http://docs.oracle.com/javase/7/docs/api/java/net/URI.html#normalize()

140

answered Sep 30 '22 11:09

Nitrodist

I found this question last night, but there wasn't an answer I was looking for so I made my own. Here it is incase somebody in the future wants it:

/**
 * - Covert the scheme and host to lowercase (done by java.net.URL)
 * - Normalize the path (done by java.net.URI)
 * - Add the port number.
 * - Remove the fragment (the part after the #).
 * - Remove trailing slash.
 * - Sort the query string params.
 * - Remove some query string params like "utm_*" and "*session*".
 */
public class NormalizeURL
{
    public static String normalize(final String taintedURL) throws MalformedURLException
    {
        final URL url;
        try
        {
            url = new URI(taintedURL).normalize().toURL();
        }
        catch (URISyntaxException e) {
            throw new MalformedURLException(e.getMessage());
        }

        final String path = url.getPath().replace("/$", "");
        final SortedMap<String, String> params = createParameterMap(url.getQuery());
        final int port = url.getPort();
        final String queryString;

        if (params != null)
        {
            // Some params are only relevant for user tracking, so remove the most commons ones.
            for (Iterator<String> i = params.keySet().iterator(); i.hasNext();)
            {
                final String key = i.next();
                if (key.startsWith("utm_") || key.contains("session"))
                {
                    i.remove();
                }
            }
            queryString = "?" + canonicalize(params);
        }
        else
        {
            queryString = "";
        }

        return url.getProtocol() + "://" + url.getHost()
            + (port != -1 && port != 80 ? ":" + port : "")
            + path + queryString;
    }

    /**
     * Takes a query string, separates the constituent name-value pairs, and
     * stores them in a SortedMap ordered by lexicographical order.
     * @return Null if there is no query string.
     */
    private static SortedMap<String, String> createParameterMap(final String queryString)
    {
        if (queryString == null || queryString.isEmpty())
        {
            return null;
        }

        final String[] pairs = queryString.split("&");
        final Map<String, String> params = new HashMap<String, String>(pairs.length);

        for (final String pair : pairs)
        {
            if (pair.length() < 1)
            {
                continue;
            }

            String[] tokens = pair.split("=", 2);
            for (int j = 0; j < tokens.length; j++)
            {
                try
                {
                    tokens[j] = URLDecoder.decode(tokens[j], "UTF-8");
                }
                catch (UnsupportedEncodingException ex)
                {
                    ex.printStackTrace();
                }
            }
            switch (tokens.length)
            {
                case 1:
                {
                    if (pair.charAt(0) == '=')
                    {
                        params.put("", tokens[0]);
                    }
                    else
                    {
                        params.put(tokens[0], "");
                    }
                    break;
                }
                case 2:
                {
                    params.put(tokens[0], tokens[1]);
                    break;
                }
            }
        }

        return new TreeMap<String, String>(params);
    }

    /**
     * Canonicalize the query string.
     *
     * @param sortedParamMap Parameter name-value pairs in lexicographical order.
     * @return Canonical form of query string.
     */
    private static String canonicalize(final SortedMap<String, String> sortedParamMap)
    {
        if (sortedParamMap == null || sortedParamMap.isEmpty())
        {
            return "";
        }

        final StringBuffer sb = new StringBuffer(350);
        final Iterator<Map.Entry<String, String>> iter = sortedParamMap.entrySet().iterator();

        while (iter.hasNext())
        {
            final Map.Entry<String, String> pair = iter.next();
            sb.append(percentEncodeRfc3986(pair.getKey()));
            sb.append('=');
            sb.append(percentEncodeRfc3986(pair.getValue()));
            if (iter.hasNext())
            {
                sb.append('&');
            }
        }

        return sb.toString();
    }

    /**
     * Percent-encode values according the RFC 3986. The built-in Java URLEncoder does not encode
     * according to the RFC, so we make the extra replacements.
     *
     * @param string Decoded string.
     * @return Encoded string per RFC 3986.
     */
    private static String percentEncodeRfc3986(final String string)
    {
        try
        {
            return URLEncoder.encode(string, "UTF-8").replace("+", "%20").replace("*", "%2A").replace("%7E", "~");
        }
        catch (UnsupportedEncodingException e)
        {
            return string;
        }
    }
}

answered Sep 30 '22 13:09

Amy B

Because you also want to identify URLs which refer to the same content, I found this paper from the WWW2007 pretty interesting: Do Not Crawl in the DUST: Different URLs with Similar Text. It provides you with a nice theoretical approach.

answered Sep 30 '22 11:09

H6.

No, there is nothing in the standard libraries to do this. Canonicalization includes things like decoding unnecessarily encoded characters, converting hostnames to lowercase, etc.

e.g. http://ACME.com/./foo%26bar becomes:

http://acme.com/foo&bar

URI's normalize() does not do this.

answered Sep 30 '22 11:09

protocol:        https 
domain name:     i0.wp.com 
subdomain:       i0 
port:            55 
path:            /lplresearch.com/wp-content/uploads/2019/01/feb.png?ssl=1 
query:           ?ssl=1" 
parameters:      &myvar=2 
fragment:        #myfragment

Code to do the URL parsing:

import java.util.*; 
import java.util.regex.*; 
public class regex { 
    public static String getProtocol(String the_url){ 
        Pattern p = Pattern.compile("^(http|https|smtp|ftp|file|pop)://.*"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getParameters(String the_url){ 
        Pattern p = Pattern.compile(".*(\\?[-a-zA-Z0-9_.@!$&''()*+,;=]+)(#.*)*$");
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static String getFragment(String the_url){ 
        Pattern p = Pattern.compile(".*(#.*)$"); 
        Matcher m = p.matcher(the_url); 
        return m.group(1); 
    } 
    public static void main(String[] args){ 
        String the_url = 
            "https://i0.wp.com:55/lplresearch.com/" + 
            "wp-content/feb.png?ssl=1&myvar=2#myfragment"; 
        System.out.println(getProtocol(the_url)); 
        System.out.println(getFragment(the_url)); 
        System.out.println(getParameters(the_url)); 
    }   
}

Prints

https
#myfragment
?ssl=1&myvar=2

You can then push and pull on the parts of the URL until they are up to muster.

answered Sep 30 '22 12:09

Eric Leschinski

Related questions
                            
                                How to pass enum as an argument in a method in java?
                            
                                Why must delegation to a different constructor happen first in a Java constructor?
                            
                                How to create custom Listeners in java?
                            
                                Application vulnerability due to Non Random Hash Functions
                            
                                How do you select a column using Hibernate?
                            
                                How does a for each loop guard against an empty list?
                            
                                Get int, float, boolean and string from Properties
                            
                                mock methods in same class
                            
                                How to open a huge excel file efficiently
                            
                                Java Metric Unit Conversion Library? [closed]
                            
                                What is the path to resource files in a Maven project?
                            
                                Java : in what order are static final fields initialized?
                            
                                Generate JPA 2 Entities from existing Database
                            
                                What is the point of using abstract methods?
                            
                                Is there a way to get a reference address? [duplicate]
                            
                                Why must throw statements be enclosed with a full code block in a lambda body? [duplicate]
                            
                                Android 9 - KeyStore exception android.os.ServiceSpecificException
                            
                                GWT vs Flex vs? [closed]
                            
                                What is the difference between a synchronized method and synchronized block in Java? [duplicate]
                            
                                Hibernate and Scala [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to normalize a URL in Java?

Tags:

java

url-rewriting

dfrankow

People also ask

7 Answers

Nitrodist

Amy B

H6.

Randy Hudson

pdxleif

Bruno

Eric Leschinski

Recent Activity

Donate For Us