Python代码从string中删除HTML标签

我有这样的文字：

text = """<div> <h1>Title</h1> <p>A long text........ </p> <a href=""> a link </a> </div>"""

使用纯Python，没有外部模块我想要这样：

 >>> print remove_tags(text) Title A long text..... a link

我知道我可以使用lxml.html.fromstring（text）.text_content（）来做到这一点，但我需要在纯Python中使用内置或std库来实现2.6+

我怎样才能做到这一点？

使用正则expression式

使用正则expression式可以清理<>所有内容：

 import re def cleanhtml(raw_html): cleanr = re.compile('<.*?>') cleantext = re.sub(cleanr, '', raw_html) return cleantext

使用BeautifulSoup

你也可以使用BeautifulSoup来查找所有的原始文本

 from bs4 import BeautifulSoup cleantext = BeautifulSoup(raw_html).text

但是这并不妨碍你使用外部库，所以我推荐第一个解决scheme。

Python有几个内置的XML模块。对于已经有一个完整的HTMLstring的情况，最简单的就是xml.etree ，它与您提到的lxml示例类似：

 def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

请注意，这不是完美的，因为如果你有类似的东西，比如说<a title=">">它会中断。然而，这是关于在没有真正复杂函数的情况下，在非库Python中最接近的：

 import re TAG_RE = re.compile(r'<[^>]+>') def remove_tags(text): return TAG_RE.sub('', text)

然而，正如lvc提到xml.etree在Python标准库中可用，所以你可能只是适应它像现有的lxml版本：

 def remove_tags(text): return ''.join(xml.etree.ElementTree.fromstring(text).itertext())

在任何C语言中都有一个简单的方法。风格不是Pythonic，而是用纯Python工作：

 def remove_html_markup(s): tag = False quote = False out = "" for c in s: if c == '<' and not quote: tag = True elif c == '>' and not quote: tag = False elif (c == '"' or c == "'") and tag: quote = not quote elif not tag: out = out + c return out

基于一个简单的有限状态机的想法，并在这里详细解释：http: //youtu.be/2tu9LTDujbw

你可以看到它在这里工作：http: //youtu.be/HPkNPcYed9M?t=35s

PS – 如果你对这个类感兴趣（关于用python进行智能debugging），我给你一个链接： http : //www.udacity.com/overview/Course/cs259/CourseRev/1 。免费！

 global temp temp ='' s = ' ' def remove_strings(text): global temp if text == '': return temp start = text.find('<') end = text.find('>') if start == -1 and end == -1 : temp = temp + text return temp newstring = text[end+1:] fresh_start = newstring.find('<') if newstring[:fresh_start] != '': temp += s+newstring[:fresh_start] remove_strings(newstring[fresh_start:]) return temp

Python代码从string中删除HTML标签

使用正则expression式

使用BeautifulSoup

从外部调用webpacked代码（HTML脚本标记）

连字符后没有换行符

为什么在注释中添加</ script>会破坏parsing器？

我可以使用表格，tr或td的最小高度？

如何获取整个文档的HTML作为一个string？

在javascript中重置文本框的值

节点对象和元素对象之间的区别？

XML / HTML标签内的空白区域

从<ul>中删除项目符号点

如何拉伸背景图片来填充div