Skip to content Skip to sidebar Skip to footer

Extract Absolute Links From A Page Using Htmlparser

I'm using the following snippet to extract all the links on a page using HTMLParser. I get quite a few relative URLs. How can I convert these to absolute URLs for a domain e.g. www

Solution 1:

You want

urlparse.urljoin(base, url[, allow_fragments])

http://docs.python.org/library/urlparse.html#urlparse.urljoin

This allows you to give an absolute or base url, and join it with a relative url. Even if they have overlapping pieces, it should work.

Post a Comment for "Extract Absolute Links From A Page Using Htmlparser"