获取URL的部分（正则expression式）

给定URL（单行）：
http://test.example.com/dir/subdir/file.html

我如何使用正则expression式提取以下部分：

子域（testing）
域（example.com）
没有文件的path（/ dir / subdir /）
该文件（file.html）
文件path（/dir/subdir/file.html）
没有path的url（ http://test.example.com ）
（添加任何你认为会有用的东西）

正则expression式应该正常工作，即使我input下面的URL：
http://example.example.com/example/example/example.html

谢谢。

一个正则expression式来parsing和分解完整的URL，包括查询参数和锚点，例如

https://www.google.com/dir/1/2/search.html?arg=0-a&arg1=1-b&arg3-c#hash

^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$

RexEx职位：

url：RegExp ['$＆']，

协议：正则expression式$ 2，

主持人：正则expression式$ 3

path：正则expression式$ 4

文件：正则expression式$ 6

查询：正则expression式$ 7，

哈希：正则expression式$ 8

你可以进一步parsing主机（'。'分隔）相当容易。

我会做的是使用这样的东西：

 /* ^(.*:)//([A-Za-z0-9\-\.]+)(:[0-9]+)?(.*)$ */ proto $1 host $2 port $3 the-rest $4

进一步parsing“其余”尽可能具体。做一个正则expression式，好吧，有点疯狂。

我意识到我迟到了，但是有一个简单的方法让浏览器为你parsing一个没有正则expression式的url：

 var a = document.createElement('a'); a.href = 'http://www.example.com:123/foo/bar.html?fox=trot#foo'; ['href','protocol','host','hostname','port','pathname','search','hash'].forEach(function(k) { console.log(k+':', a[k]); }); /*//Output: href: http://www.example.com:123/foo/bar.html?fox=trot#foo protocol: http: host: www.example.com:123 hostname: www.example.com port: 123 pathname: /foo/bar.html search: ?fox=trot hash: #foo */

我晚了几年，但我很惊讶，没有人提到统一资源标识符规范有一个用正则expression式parsingURI的部分。 Berners-Lee等人撰写的正则expression式是：

 ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? 12 3 4 5 6 7 8 9 
上面第二行的数字只是为了提高可读性。他们指出每个子expression的参考点（即每个成对的括号）。我们将与subexpression匹配的值称为$。例如，将上述expression式匹配到

http://www.ics.uci.edu/pub/ietf/uri/#Related

导致以下子expression式匹配：
 $1 = http: $2 = http $3 = //www.ics.uci.edu $4 = www.ics.uci.edu $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Related $9 = Related 

为了什么是值得的，我发现我必须逃避JavaScript中的正斜杠：

^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?

我发现最高的投票答案（hometoast的答案）对我来说并不完美。两个问题：

它不能处理端口号。
散列部分被破坏。

以下是修改后的版本：

 ^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$

部件位置如下：

 int SCHEMA = 2, DOMAIN = 3, PORT = 5, PATH = 6, FILE = 8, QUERYSTRING = 9, HASH = 12

匿名用户发布的修改

 function getFileName(path) { return path.match(/^((http[s]?|ftp):\/)?\/?([^:\/\s]+)(:([^\/]*))?((\/[\w\/-]+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?$/i)[8]; }

我需要一个正则expression式来匹配所有的url，并做出这一个：

 /(?:([^\:]*)\:\/\/)?(?:([^\:\@]*)(?:\:([^\@]*))?\@)?(?:([^\/\:]*)\.(?=[^\.\/\:]*\.[^\.\/\:]*))?([^\.\/\:]*)(?:\.([^\/\.\:]*))?(?:\:([0-9]*))?(\/[^\?#]*(?=.*?\/)\/)?([^\?#]*)?(?:\?([^#]*))?(?:#(.*))?/

它匹配所有的URL，任何协议，甚至urls喜欢

 ftp://user:pass@www.cs.server.com:8080/dir1/dir2/file.php?param1=value1#hashtag

结果（在JavaScript中）看起来像这样：

 ["ftp", "user", "pass", "www.cs", "server", "com", "8080", "/dir1/dir2/", "file.php", "param1=value1", "hashtag"]

像一个url

 mailto://admin@www.cs.server.com

看起来像这样：

 ["mailto", "admin", undefined, "www.cs", "server", "com", undefined, undefined, undefined, undefined, undefined]

这不是一个直接的答案，但大多数Web库都有一个function来完成这个任务。这个函数通常被称为类似于CrackUrl东西。如果存在这样的function，那么使用它，几乎保证比任何手工代码更可靠和更高效。

我试图在JavaScript中解决这个问题，应该通过以下方式来处理：

 var url = new URL('http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang');

因为（在Chrome中，至less）它parsing为：

 { "hash": "#foobar/bing/bo@ng?bang", "search": "?foo=bar&bingobang=&king=kong@kong.com", "pathname": "/path/wah@t/foo.js", "port": "890", "hostname": "example.com", "host": "example.com:890", "password": "b", "username": "a", "protocol": "http:", "origin": "http://example.com:890", "href": "http://a:b@example.com:890/path/wah@t/foo.js?foo=bar&bingobang=&king=kong@kong.com#foobar/bing/bo@ng?bang" }

但是，这不是跨浏览器（ https://developer.mozilla.org/en-US/docs/Web/API/URL ），所以我拼凑在一起拉出相同的部分如上：

 ^(?:(?:(([^:\/#\?]+:)?(?:(?:\/\/)(?:(?:(?:([^:@\/#\?]+)(?:\:([^:@\/#\?]*))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((?:\/?(?:[^\/\?#]+\/+)*)(?:[^\?#]*)))?(\?[^#]+)?)(#.*)?

信用这个正则expression式去https://gist.github.com/rpflorence谁发布了这个jsperf http://jsperf.com/url-parsing （最初在这里find： https ：//gist.github.com/jlong/2428561 ＃评论 – 310066 ）谁提出了正则expression式，这是最初的基础。

部件按此顺序排列：

 var keys = [ "href", // http://user:pass@host.com:81/directory/file.ext?query=1#anchor "origin", // http://user:pass@host.com:81 "protocol", // http: "username", // user "password", // pass "host", // host.com:81 "hostname", // host.com "port", // 81 "pathname", // /directory/file.ext "search", // ?query=1 "hash" // #anchor ];

还有一个小型图书馆，它提供了查询参数：

https://github.com/sadams/lite-url （也可用于凉亭）

如果你有一个改进，请创build一个拉更多的testing请求，我会接受和合并谢谢。

子域和域是困难的，因为子域可以有几个部分，顶级域也可以是http://sub1.sub2.domain.co.uk/

  the path without the file : http://[^/]+/((?:[^/]+/)*(?:[^/]+$)?) the file : http://[^/]+/(?:[^/]+/)*((?:[^/.]+\.)+[^/.]+)$ the path with the file : http://[^/]+/(.*) the URL without the path : (http://[^/]+/)

（Markdown对正则expression式不是很友好）

这个改进的版本应该像parsing器一样可靠地工作。

  // Applies to URI, not just URL or URN: // http://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Relationship_to_URL_and_URN // // http://labs.apache.org/webarch/uri/rfc/rfc3986.html#regexp // // (?:([^:/?#]+):)?(?://([^/?#]*))?([^?#]*)(?:\?([^#]*))?(?:#(.*))? // // http://en.wikipedia.org/wiki/URI_scheme#Generic_syntax // // $@ matches the entire uri // $1 matches scheme (ftp, http, mailto, mshelp, ymsgr, etc) // $2 matches authority (host, user:pwd@host, etc) // $3 matches path // $4 matches query (http GET REST api, etc) // $5 matches fragment (html anchor, etc) // // Match specific schemes, non-optional authority, disallow white-space so can delimit in text, and allow 'www.' w/o scheme // Note the schemes must match ^[^\s|:/?#]+(?:\|[^\s|:/?#]+)*$ // // (?:()(www\.[^\s/?#]+\.[^\s/?#]+)|(schemes)://([^\s/?#]*))([^\s?#]*)(?:\?([^\s#]*))?(#(\S*))? // // Validate the authority with an orthogonal RegExp, so the RegExp above won't fail to match any valid urls. function uriRegExp( flags, schemes/* = null*/, noSubMatches/* = false*/ ) { if( !schemes ) schemes = '[^\\s:\/?#]+' else if( !RegExp( /^[^\s|:\/?#]+(?:\|[^\s|:\/?#]+)*$/ ).test( schemes ) ) throw TypeError( 'expected URI schemes' ) return noSubMatches ? new RegExp( '(?:www\\.[^\\s/?#]+\\.[^\\s/?#]+|' + schemes + '://[^\\s/?#]*)[^\\s?#]*(?:\\?[^\\s#]*)?(?:#\\S*)?', flags ) : new RegExp( '(?:()(www\\.[^\\s/?#]+\\.[^\\s/?#]+)|(' + schemes + ')://([^\\s/?#]*))([^\\s?#]*)(?:\\?([^\\s#]*))?(?:#(\\S*))?', flags ) } // http://en.wikipedia.org/wiki/URI_scheme#Official_IANA-registered_schemes function uriSchemesRegExp() { return 'about|callto|ftp|gtalk|http|https|irc|ircs|javascript|mailto|mshelp|sftp|ssh|steam|tel|view-source|ymsgr' }

尝试以下操作：

 ^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]{1}([\w\-]+\.)+([\w]{2,5}))(:[\d]{1,5})?((/?\w+/)+|/?)(\w+\.[\w]{3,4})?((\?\w+=\w+)?(&\w+=\w+)*)?

它支持HTTP / FTP，子域名，文件夹，文件等

我发现它从一个快速的谷歌search：

http://geekswithblogs.net/casualjim/archive/2005/12/01/61722.aspx

 /^((?P<scheme>https?|ftp):\/)?\/?((?P<username>.*?)(:(?P<password>.*?)|)@)?(?P<hostname>[^:\/\s]+)(?P<port>:([^\/]*))?(?P<path>(\/\w+)*\/)(?P<filename>[-\w.]+[^#?\s]*)?(?P<query>\?([^#]*))?(?P<fragment>#(.*))?$/

从我对类似问题的回答。因为他们有一些错误（如不支持用户名/密码，不支持单字符文件名，碎片标识符被破坏），所以比其他一些提到的更好。

build议一个更可读的解决scheme（在Python中，但适用于任何正则expression式）：

 def url_path_to_dict(path): pattern = (r'^' r'((?P<schema>.+?)://)?' r'((?P<user>.+?)(:(?P<password>.*?))?@)?' r'(?P<host>.*?)' r'(:(?P<port>\d+?))?' r'(?P<path>/.*?)?' r'(?P<query>[?].*?)?' r'$' ) regex = re.compile(pattern) m = regex.match(path) d = m.groupdict() if m is not None else None return d def main(): print url_path_to_dict('http://example.example.com/example/example/example.html')

打印：

 { 'host': 'example.example.com', 'user': None, 'path': '/example/example/example.html', 'query': None, 'password': None, 'port': None, 'schema': 'http' }

您可以通过使用.NET中的Uri对象来获取所有http / https，主机，端口，path以及查询。只是把主机分成子域名，域名和顶级域名（TLD）这个难题。

没有标准这样做，不能简单地使用stringparsing或RegEx来产生正确的结果。起初，我正在使用RegEx函数，但不是所有的URL都可以正确parsing子域。实践方法是使用顶级域名列表。在定义URL的TLD之后，左边的部分是域，其余的是子域。

但是，自从有了新的顶级域名（TLD）之后，该名单需要维护。目前我知道的是publicsuffix.org维护最新的列表，你可以使用谷歌代码中的域名parsing器工具来parsing公共后缀列表，并通过使用DomainName对象轻松获取子域，域和顶级域名：domainName.SubDomain，domainName .Domain和domainName.TLD。

这个答案也有帮助：从URL获取子域名

CaLLMeLaNN

我会build议不要使用正则expression式。像WinHttpCrackUrl（）这样的API调用不太容易出错。

http://msdn.microsoft.com/en-us/library/aa384092%28VS.85%29.aspx

可悲的是，这不适用于一些url。拿这个例子来说吧： http : //www.example.org/&value=329

也没有值= 329

甚至根本没有参数（一个简单的URL）！

我明白，正则expression式期待一些严重复杂/长的URL，但它也应该能够在简单的URL上工作，对吗？

以上都没有为我工作。这是我最终使用的：

 /^(?:((?:https?|s?ftp):)\/\/)([^:\/\s]+)(?::(\d*))?(?:\/([^\s?#]+)?([?][^?#]*)?(#.*)?)?/

我喜欢在“Javascript：The Good Parts”中发布的正则expression式。它不是太短，也不是太复杂。 github上的这个页面也有使用它的JavaScript代码。但它适用于任何语言。 https://gist.github.com/voodooGQ/4057330

Java提供了一个URL类，可以做到这一点。查询URL对象。

在附注中，PHP提供了parse_url（）。

这是一个完整的，不依赖于任何协议。

 function getServerURL(url) { var m = url.match("(^(?:(?:.*?)?//)?[^/?#;]*)"); console.log(m[1]) // Remove this return m[1]; } getServerURL("http://dev.test.se") getServerURL("http://dev.test.se/") getServerURL("//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js") getServerURL("//") getServerURL("www.dev.test.se/sdas/dsads") getServerURL("www.dev.test.se/") getServerURL("www.dev.test.se?abc=32") getServerURL("www.dev.test.se#abc") getServerURL("//dev.test.se?sads") getServerURL("http://www.dev.test.se#321") getServerURL("http://localhost:8080/sads") getServerURL("https://localhost:8080?sdsa")

打印

 http://dev.test.se http://dev.test.se //ajax.googleapis.com // www.dev.test.se www.dev.test.se www.dev.test.se www.dev.test.se //dev.test.se http://www.dev.test.se http://localhost:8080 https://localhost:8080

使用http://www.fileformat.info/tool/regex.htm hometoast的正则expression式很好。

但是，这是交易，我想在我的程序中的不同情况下使用不同的正则expression式模式。

例如，我有这个URL，我有一个枚举，列出了我的程序中所有支持的URL。枚举中的每个对象都有一个getRegexPattern方法，该方法返回正则expression式模式，然后将其用于与URL进行比较。如果特定的正则expression式模式返回true，那么我知道这个URL是我的程序支持的。因此，每个枚举都有它自己的正则expression式，取决于它应该在URL内部的位置。

Hometoast的build议是伟大的，但在我的情况下，我认为这将无济于事（除非我复制粘贴在所有枚举相同的正则expression式）。

这就是为什么我想要回答给每个情况分别正则expression式。虽然+1为hometoast。 ;）

我知道你在这方面声称语言不可知，但是你能告诉我们你正在使用什么，所以我们知道你有什么正则expression式的能力？

如果您具有非捕获匹配的function，则可以修改hometoast的expression式，以便您不想捕获的子expression式如下设置：

(?:SOMESTUFF)

你仍然需要将正则expression式复制粘贴到多个位置，但是这样做是有道理的 – 你不是只是检查子expression式是否存在，而是作为URL的一部分存在。使用子expression式的非捕获修饰符可以给你你所需要的东西，除此之外，如果我正确地阅读你，就是你想要的东西。

就像一个小小的便条一样，hometoast的expression方式不需要在“https”的括号内加括号，因为他只有一个字符。量词量化直接在它们之前的一个字符（或字符类或子expression式）。所以：

https?

会匹配'http'或'https'就好了。

正则expression式来获取没有文件的URLpath。

url =' http：// domain / dir1 / dir2 / somefile'url.scan （/ ^（http：// [^ /] +）（（？：/ [^ /] +）+（？= /））？/（？：[^ /] +）？$ / I）.to_s

向这个url添加一个相对path是很有用的。

 String s = "https://www.thomas-bayer.com/axis2/services/BLZService?wsdl"; String regex = "(^http.?://)(.*?)([/\\?]{1,})(.*)"; System.out.println("1: " + s.replaceAll(regex, "$1")); System.out.println("2: " + s.replaceAll(regex, "$2")); System.out.println("3: " + s.replaceAll(regex, "$3")); System.out.println("4: " + s.replaceAll(regex, "$4"));

将提供以下输出：
1：https：//
2：www.thomas-bayer.com
3：/
4：axis2 / services / BLZService？wsdl

如果您将URL更改为
String s =“ https://www.thomas-bayer.com?wsdl=qwerwer&ttt=888 ”; 输出结果如下：
1：https：//
2：www.thomas-bayer.com
3：？
4：wsdl = qwerwer＆ttt = 888

请享用..
Yosi列弗

正则expression式完成parsing是非常可怕的。我已经包含了可读性的命名反向引用，并将每个部分分成不同的行，但它仍然是这样的：

 ^(?:(?P<protocol>\w+(?=:\/\/))(?::\/\/))? (?:(?P<host>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::(?P<port>[0-9]+))?)\/)? (?:(?P<path>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)? (?P<file>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+) (?:\?(?P<querystring>(?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))? (?:#(?P<fragment>.*))?$

需要这么详细的事情是，除了协议或端口，任何部分可以包含HTML实体，这使得片段的描述非常棘手。所以在最后几种情况下 – 主机，path，文件，查询string和片段，我们允许任何html实体或任何不是? 或# 。一个html实体的正则expression式如下所示：

 $htmlentity = "&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);"

当提取出来的时候（我用胡子语法来表示它），它变得更清晰一些：

 ^(?:(?P<protocol>(?:ht|f)tps?|\w+(?=:\/\/))(?::\/\/))? (?:(?P<host>(?:{{htmlentity}}|[^\/?#:])+(?::(?P<port>[0-9]+))?)\/)? (?:(?P<path>(?:{{htmlentity}}|[^?#])+)\/)? (?P<file>(?:{{htmlentity}}|[^?#])+) (?:\?(?P<querystring>(?:{{htmlentity}};|[^#])+))? (?:#(?P<fragment>.*))?$

在JavaScript中，当然，你不能使用命名的反向引用，所以正则expression式就变成了

 ^(?:(\w+(?=:\/\/))(?::\/\/))?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^\/?#:]+)(?::([0-9]+))?)\/)?(?:((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)\/)?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^?#])+)(?:\?((?:(?:&(?:amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|\#[1-9][0-9]{1,3}|[A-Za-z][0-9A-Za-z]+);)|[^#])+))?(?:#(.*))?$

在每个匹配中，协议是\1 ，主机是\2 ，端口是\3 ，path\4 ，文件\5 ，查询string\6和片段\7 。

我尝试了一些这些不包括我的需求，特别是最高的投票，没有path（ http://example.com/ ）

也缺乏团体名称使其无法使用（或者也许我的jinja2技能缺乏）。

所以这是我的版本稍微修改与来源是最高的投票版本在这里：

 ^((?P<protocol>http[s]?|ftp):\/)?\/?(?P<host>[^:\/\s]+)(?P<path>((\/\w+)*\/)([\w\-\.]+[^#?\s]+))*(.*)?(#[\w\-]+)?$

 //USING REGEX /** * Parse URL to get information * * @param url the URL string to parse * @return parsed the URL parsed or null */ var UrlParser = function (url) { "use strict"; var regx = /^(((([^:\/#\?]+:)?(?:(\/\/)((?:(([^:@\/#\?]+)(?:\:([^:@\/#\?]+))?)@)?(([^:\/#\?\]\[]+|\[[^\/\]@#?]+\])(?:\:([0-9]+))?))?)?)?((\/?(?:[^\/\?#]+\/+)*)([^\?#]*)))?(\?[^#]+)?)(#.*)?/, matches = regx.exec(url), parser = null; if (null !== matches) { parser = { href : matches[0], withoutHash : matches[1], url : matches[2], origin : matches[3], protocol : matches[4], protocolseparator : matches[5], credhost : matches[6], cred : matches[7], user : matches[8], pass : matches[9], host : matches[10], hostname : matches[11], port : matches[12], pathname : matches[13], segment1 : matches[14], segment2 : matches[15], search : matches[16], hash : matches[17] }; } return parser; }; var parsedURL=UrlParser(url); console.log(parsedURL);

获取URL的部分（正则expression式）

代码高尔夫：康威的人生游戏

我是不道德的使用一个variables名称，只有大小写不同的types？

什么是存在型？

为什么quicksort比mergesort更好？

什么是幂等操作？

用于创build色轮的function

语音令人难忘的密码生成algorithm

什么是不variables？

教新手入门的最佳方法？

如何从一个平坦的结构有效地build立一棵树？