我怎样才能正常化在Python中的URL

我想知道我正规化在Python中的URL。

例如，如果我有一个urlstring，如：“ http://www.example.com/foo goo / bar.html”

我需要一个Python库，将额外的空间（或任何其他非标准化的字符）转换为适当的URL。

看看这个模块： werkzeug.utils 。（现在在werkzeug.urls ）

你正在寻找的function被称为“url_fix”，并像这样工作：

 >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)') 'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'

它在Werkzeug中实现如下：

 import urllib import urlparse def url_fix(s, charset='utf-8'): """Sometimes you get an URL by a user that just isn't a real URL because it contains unsafe characters like ' ' and so on. This function can fix some of the problems in a similar way browsers handle data entered by the user: >>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)') 'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29' :param charset: The target charset for the URL if the url was given as unicode string. """ if isinstance(s, unicode): s = s.encode(charset, 'ignore') scheme, netloc, path, qs, anchor = urlparse.urlsplit(s) path = urllib.quote(path, '/%') qs = urllib.quote_plus(qs, ':&=') return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

真正解决Python 2.7中的问题

正确的解决办法是

  # percent encode url, fixing lame server errors for eg, like space # within url paths. fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")

有关更多信息，请参阅问题918368：“urllib不正确的服务器返回的URL”

使用urllib.quote或urllib.quote_plus

从urllib文档：

报价（string[，安全]）

用“％xx”转义replacestring中的特殊字符。字母，数字和字符“_.-”从不引用。可选的安全参数指定不应引用的附加字符 – 其默认值为“/”。

例如： quote('/~connolly/')产生'/%7econnolly/' 。

quote_plus（string [，safe]）

像引号（）一样，但是也可以用加号来replace空格，正如引用HTML表单值所需的那样。原始string中的加号除非包含在保险箱中，否则将被转义。它也没有安全的默认“/”。

编辑：在整个URL上使用urllib.quote或urllib.quote_plus将破坏它，如@ΤΖΩΤΖΙΟΥ指出：

 >>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html') >>> quoted_url 'http%3A//www.example.com/foo%20goo/bar.html' >>> urllib2.urlopen(quoted_url) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\python25\lib\urllib2.py", line 124, in urlopen return _opener.open(url, data) File "c:\python25\lib\urllib2.py", line 373, in open protocol = req.get_type() File "c:\python25\lib\urllib2.py", line 244, in get_type raise ValueError, "unknown url type: %s" % self.__original ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html

@ΤΖΩΤΖΙΟΥ提供了一个函数，它使用urlparse.urlparse和urlparse.urlunparseparsingURL并且只编码path。这可能对你更有用，但是如果你从一个已知的协议和主机构buildURL，但是有一个可疑的path，你可能也可以做得很好，以避免urlparse，只引用URL中的可疑部分，连接已知的安全部件。

因为这个页面是谷歌search主题的最好结果，所以我认为值得一提的是，一些使用Python标准化的工作已经超越了urlencoding空格字符。例如，处理默认的端口，字符大小写，缺less尾随斜线等。

当Atom联合格式正在开发之中时，就如何将URL规范化为规范格式进行了一些讨论; 这在Atom / Pie wiki上的文章PaceCanonicalIds中有logging。那篇文章提供了一些好的testing用例。

我相信讨论的一个结果是Mark Nottingham的urlnorm.py库，我已经在一些项目中使用了这个库。但是，该脚本不适用于此问题中提供的URL。所以一个更好的select可能是Sam Ruby的urlnorm.py版本，它处理这个URL，以及Atom wiki上述的所有testing用例。

 import urlparse, urllib def myquote(url): parts= urlparse.urlparse(url) return urlparse.urlunparse(parts[:2] + urllib.quote(parts[2]) + parts[3:])

这只引用path组件。

否则，你可以这样做： urllib.quote(url, safe=":/")

urlform只是FYI，已经转移到github： http ://gist.github.com/246089

我遇到这样一个问题：只需要引用空间。

fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'@()*[]")有帮助，但是太复杂了。

所以我用了一个简单的方法： url = url.replace(' ', '%20') ，这不是完美的，但它是最简单的方法，适用于这种情况。

适用于Python 3.5：

 import urllib.parse urllib.parse.quote([your_url], "\./_-:")

例：

 import urllib.parse print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))

输出将是http://www.example.com/foo%20goo/bar.html

字体： https : //docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

我怎样才能正常化在Python中的URL

标准化R中的数据列

我如何在MongoDB中执行SQL Join等价物？

MYSQL中的规范化

我应该在我的Bootstrap项目中使用normalize.css吗？