Recent Events for foo.be MainPageDiary (Blog)

HowToNormalizeURL

Normalizing URL/(URI) or URL canonization is often important when you want to store and query a database containing url. The main question is the definition of URL normalization. I'm just trying to gather the various definitions available.

From RFC2396bis

http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html#comparison

From URI Perl module

" Returns a normalized version of the URI. The rules for normalization are scheme-dependent. They usually involve lowercasing the scheme and Internet host name components, removing the explicit port specification if it matches the default port, uppercasing all escape sequences, and unescaping octets that can be better represented as plain characters. "

From Axis API Java

" normalize

public static java.net.URL normalize(java.net.URL url) if the url points to a file then make sure we cleanup ".." "." etc. "

From a PHP class

http://www.phpclasses.org/browse/package/1844.html " Normalization consists in making the pages be served under an URL without any query parameters that usually follow the question mark in the original URLs. The normalized URLs make the query parameters appear as if they are directory path names of site page virtual files. "

From https://github.com/apphacker/Normalize-URL in Go

https://github.com/apphacker/Normalize-URL

" A Go package to normalize a URL as per

http://en.wikipedia.org/wiki/URL_normalization

This package is not overly aggressive and errs on the side of preserving a working URL. "

Open Question

Currently, the rating of a web page applies to the page as a whole. The lookup mechanism is also just providing an aggregated rating of the complete page, so there is (yet) no point in processing anchor links. In short, we just remove them for now.