从外部网站获取标题和元标记

我想尝试弄清楚如何得到

<title>A common title</title> <meta name="keywords" content="Keywords blabla" /> <meta name="description" content="This is the description" />

即使按照任意顺序排列，我也听说过PHP Simple DOM DOMparsing器，但我并不想使用它。除了使用PHP简单的HTML DOMparsing器之外，是否可以使用这个解决scheme。

preg_match将无法做到，如果它是无效的HTML？

cURL可以用preg_match做这样的事情吗？

Facebook做了这样的事情，但它正确使用通过使用：

 <meta property="og:description" content="Description blabla" />

我想要这样的东西，这样当有人发布一个链接，它应该检索标题和meta标签。如果没有meta标签，那么它就会被忽略，或者用户可以自己设置它（但是我稍后会自己做）。

这是应该的方式：

 function file_get_contents_curl($url) { $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $data = curl_exec($ch); curl_close($ch); return $data; } $html = file_get_contents_curl("http://example.com/"); //parsing begins here: $doc = new DOMDocument(); @$doc->loadHTML($html); $nodes = $doc->getElementsByTagName('title'); //get and display what you need: $title = $nodes->item(0)->nodeValue; $metas = $doc->getElementsByTagName('meta'); for ($i = 0; $i < $metas->length; $i++) { $meta = $metas->item($i); if($meta->getAttribute('name') == 'description') $description = $meta->getAttribute('content'); if($meta->getAttribute('name') == 'keywords') $keywords = $meta->getAttribute('content'); } echo "Title: $title". '<br/><br/>'; echo "Description: $description". '<br/><br/>'; echo "Keywords: $keywords";

 <?php // Assuming the above tags are at www.example.com $tags = get_meta_tags('http://www.example.com/'); // Notice how the keys are all lowercase now, and // how . was replaced by _ in the key. echo $tags['author']; // name echo $tags['keywords']; // php documentation echo $tags['description']; // a php manual echo $tags['geo_position']; // 49.33;-86.59 ?>

PHP的本地函数：get_meta_tags（）

http://php.net/manual/en/function.get-meta-tags.php

get_meta_tags标题， get_meta_tags将帮助你。要获得标题只需使用正则expression式。

 $url = 'http://some.url.com'; preg_match("/<title>(.+)<\/title>/siU", file_get_contents($url), $matches); $title = $matches[1];

希望有所帮助。

你最好的select是硬着头皮使用DOMparsing器 – 这是做“正确的方式”。从长远来看，它将为您节省更多的时间，而不是学习如何。用正则expression式parsingHTML是不可靠的，不能容忍特殊情况。

get_meta_tags没有与标题一起工作。

只有具有名称属性的元标记

 <meta name="description" content="the description">

将被parsing。

我们使用Apache Tika通过php（命令行实用程序）与-j为json：

http://tika.apache.org/

 <?php shell_exec( 'java -jar tika-app-1.4.jar -j http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying' ); ?>

这是一个随机的监护人文章的输出示例：

 { "Content-Encoding":"UTF-8", "Content-Length":205599, "Content-Type":"text/html; charset\u003dUTF-8", "DC.date.issued":"2013-07-21", "X-UA-Compatible":"IE\u003dEdge,chrome\u003d1", "application-name":"The Guardian", "article:author":"http://www.guardian.co.uk/profile/nicholaswatt", "article:modified_time":"2013-07-21T22:42:21+01:00", "article:published_time":"2013-07-21T22:00:03+01:00", "article:section":"Politics", "article:tag":[ "Lynton Crosby", "Health policy", "NHS", "Health", "Healthcare industry", "Society", "Public services policy", "Lobbying", "Conservatives", "David Cameron", "Politics", "UK news", "Business" ], "content-id":"/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying", "dc:title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian", "description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027", "fb:app_id":180444840287, "keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics", "msapplication-TileColor":"#004983", "msapplication-TileImage":"http://static.guim.co.uk/static/a314d63c616d4a06f5ec28ab4fa878a11a692a2a/commonhttp://img.dovov.comfavicons/windows_tile_144_b.png", "news_keywords":"Lynton Crosby,Health policy,NHS,Health,Healthcare industry,Society,Public services policy,Lobbying,Conservatives,David Cameron,Politics,UK news,Business,Politics", "og:description":"Exclusive: Firm he founded, Crosby Textor, advised private healthcare providers how to exploit NHS \u0027failings\u0027", "og:image":"https://static-secure.guim.co.uk/sys-images/Guardian/Pix/pixies/2013/7/21/1374433351329/Lynton-Crosby-008.jpg", "og:site_name":"the Guardian", "og:title":"Tory strategist Lynton Crosby in new lobbying row", "og:type":"article", "og:url":"http://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying", "resourceName":"tory-strategist-lynton-crosby-lobbying", "title":"Tory strategist Lynton Crosby in new lobbying row | Politics | The Guardian", "twitter:app:id:googleplay":"com.guardian", "twitter:app:id:iphone":409128287, "twitter:app:name:googleplay":"The Guardian", "twitter:app:name:iphone":"The Guardian", "twitter:app:url:googleplay":"guardian://www.guardian.co.uk/politics/2013/jul/21/tory-strategist-lynton-crosby-lobbying", "twitter:card":"summary_large_image", "twitter:site":"@guardian" }

http://php.net/manual/en/function.get-meta-tags.php

 <?php // Assuming the above tags are at www.example.com $tags = get_meta_tags('http://www.example.com/'); // Notice how the keys are all lowercase now, and // how . was replaced by _ in the key. echo $tags['author']; // name echo $tags['keywords']; // php documentation echo $tags['description']; // a php manual echo $tags['geo_position']; // 49.33;-86.59 ?>

不幸的是，内置的PHP函数get_meta_tags（）需要name参数，而某些站点（如twitter）会将此属性留在偏好的位置。这个函数使用正则expression式和dom文档的混合，将从网页返回一个键控的元标记数组。它检查名称参数，然后是属性参数。这已经在instragram，pinterest和twitter上testing过。

 /** * Extract metatags from a webpage */ function extract_tags_from_url($url) { $tags = array(); $ch = curl_init(); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); $contents = curl_exec($ch); curl_close($ch); if (empty($contents)) { return $tags; } if (preg_match_all('/<meta([^>]+)content="([^>]+)>/', $contents, $matches)) { $doc = new DOMDocument(); $doc->loadHTML('<?xml encoding="utf-8" ?>' . implode($matches[0])); $tags = array(); foreach($doc->getElementsByTagName('meta') as $metaTag) { if($metaTag->getAttribute('name') != "") { $tags[$metaTag->getAttribute('name')] = $metaTag->getAttribute('content'); } elseif ($metaTag->getAttribute('property') != "") { $tags[$metaTag->getAttribute('property')] = $metaTag->getAttribute('content'); } } } return $tags; }

从url获取元标记，php函数示例：

 function get_meta_tags ($url){ $html = load_content ($url,false,""); print_r ($html); preg_match_all ("/<title>(.*)<\/title>/", $html["content"], $title); preg_match_all ("/<meta name=\"description\" content=\"(.*)\"\/>/i", $html["content"], $description); preg_match_all ("/<meta name=\"keywords\" content=\"(.*)\"\/>/i", $html["content"], $keywords); $res["content"] = @array("title" => $title[1][0], "descritpion" => $description[1][0], "keywords" => $keywords[1][0]); $res["msg"] = $html["msg"]; return $res; }

例：

 print_r (get_meta_tags ("bing.com") );

获取Meta标签的PHP

易和php的内置function。

http://php.net/manual/en/function.get-meta-tags.php

 <?php // ------------------------------------------------------ function curl_get_contents($url) { $timeout = 5; $useragent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko/20100101 Firefox/27.0'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_USERAGENT, $useragent); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); $data = curl_exec($ch); curl_close($ch); return $data; } // ------------------------------------------------------ function fetch_meta_tags($url) { $html = curl_get_contents($url); $mdata = array(); $doc = new DOMDocument(); $doc->loadHTML($html); $titlenode = $doc->getElementsByTagName('title'); $title = $titlenode->item(0)->nodeValue; $metanodes = $doc->getElementsByTagName('meta'); foreach($metanodes as $node) { $key = $node->getAttribute('name'); $val = $node->getAttribute('content'); if (!empty($key)) { $mdata[$key] = $val; } } $res = array($url, $title, $mdata); return $res; } // ------------------------------------------------------ ?>

如果您正在使用PHP，请查看pear.php.net上的Pear包，看看您是否find有用的东西。我已经有效地使用了RSS包，并且节省了大量的时间，只要你能够通过他们的例子来实现他们的代码。

特别看看Sax 3 ，看看它是否会满足您的需求。萨克斯3不再更新，但它可能是足够的。

正如已经说过的，这可以解决这个问题：

 $url='http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site/4640613'; $meta=get_meta_tags($url); echo $title=$meta['title']; //php - Get Title and Meta Tags of External site - Stack Overflow

我根据最佳答案制作了这个小型composer php包： https ： //github.com/diversen/get-meta-tags

 composer require diversen/get-meta-tags

接着：

 use diversen\meta; $m = new meta(); // Simple usage, get's title, description, and keywords by default $ary = $m->getMeta('https://github.com/diversen/get-meta-tags'); print_r($ary); // With more params $ary = $m->getMeta('https://github.com/diversen/get-meta-tags', array ('description' ,'keywords'), $timeout = 10); print_r($ary);

它需要CURL和DOMDocument作为最好的答案 – 并且是以这种方式构build的，但是可以select设置curl超时（以及获取所有types的meta标签）。

现在，大部分网站都会在其网站上添加元标记，以提供有关其网站或任何特定文章页面的信息。如新闻或博客网站。

我创build了一个Meta API，它提供了所需的元数据交stream，如OpenGraph，Schema.Org等

检查出来 – https://api.sakiv.com/docs

这里是PHP简单的DOM HTML Class两行代码来获取页面META的详细信息。

 $html = file_get_html($link); $meat_description = $html->find('head meta[name=description]', 0)->content; $meat_keywords = $html->find('head meta[name=keywords]', 0)->content;

从外部网站获取标题和元标记

禁用android的chrome下拉刷新function

如何开始在Internet Explorer中自动下载文件？

在<head>中，首先是：<meta>或<title>？

元标记内的属性property =“og：title”是什么？

Facebook开放图元标签的最大内容长度

dynamic生成Facebook Open Graph元标记

<meta charset =“utf-8”> vs <meta http-equiv =“Content-Type”>

哪些是标准的W3C元标记？

是否有可能使用JavaScript来改变页面的元标签？

强制打开“另存为…”popup在文本链接打开点击在HTML中的PDF