使用Python从HTML文件中提取文本

我想使用Python从HTML文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我基本上会得到相同的输出结果。

我想要比使用正则表达式更强大的东西，可能会失败，形成不良的HTML。我见过很多人推荐美丽的汤，但是我使用它有一些问题。首先，它收集不需要的文本，例如JavaScript源代码。而且，它没有解释HTML实体。例如，我希望“ 在HTML源文件中被转换为撇号，就像我把浏览器内容粘贴到记事本一样。

更新 html2text看起来很有希望。它正确处理HTML实体并忽略JavaScript。但是，它并不完全产生纯文本; 它会产生降价，然后不得不变成纯文本。它没有任何示例或文档，但代码看起来很干净。

相关问题：

过滤掉HTML标签并在Python中解析实体
在Python中将XML / HTML实体转换为Unicode字符串

html2text是一个Python程序，在这方面做得非常好。

注： NTLK不再支持clean_html功能

下面的原始答案，并在评论部分的替代。

使用NLTK

我浪费了4-5个小时，解决了html2text的问题。幸运的是我可以遇到NLTK。
它神奇地工作。

 import nltk from urllib import urlopen url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw)

我找到的最好的一段代码提取文本没有得到JavaScript或不想要的东西：

 import urllib from bs4 import BeautifulSoup url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # kill all script and style elements for script in soup(["script", "style"]): script.extract() # rip it out # get text text = soup.get_text() # break into lines and remove leading and trailing space on each lines = (line.strip() for line in text.splitlines()) # break multi-headlines into a line each chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # drop blank lines text = '\n'.join(chunk for chunk in chunks if chunk) print(text)

你只需要安装BeautifulSoup之前：

 pip install beautifulsoup4

发现自己今天面临同样的问题。我写了一个非常简单的HTML解析器来去除所有标记的传入内容，只用最少的格式返回剩余的文本。

 from HTMLParser import HTMLParser from re import sub from sys import stderr from traceback import print_exc class _DeHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.__text = [] def handle_data(self, data): text = data.strip() if len(text) > 0: text = sub('[ \t\r\n]+', ' ', text) self.__text.append(text + ' ') def handle_starttag(self, tag, attrs): if tag == 'p': self.__text.append('\n\n') elif tag == 'br': self.__text.append('\n') def handle_startendtag(self, tag, attrs): if tag == 'br': self.__text.append('\n\n') def text(self): return ''.join(self.__text).strip() def dehtml(text): try: parser = _DeHTMLParser() parser.feed(text) parser.close() return parser.text() except: print_exc(file=stderr) return text def main(): text = r''' <html> <body> <b>Project:</b> DeHTML<br> <b>Description</b>:<br> This small script is intended to allow conversion from HTML markup to plain text. </body> </html> ''' print(dehtml(text)) if __name__ == '__main__': main()

这是一个更完整的xperroni的答案版本。它跳过脚本和样式部分，并转换charrefs（例如“＆”）和HTML实体（例如＆amp;）。

它还包括一个普通的纯文本到html的反转换器。

 """ HTML <-> text conversions. """ from HTMLParser import HTMLParser, HTMLParseError from htmlentitydefs import name2codepoint import re class _HTMLToText(HTMLParser): def __init__(self): HTMLParser.__init__(self) self._buf = [] self.hide_output = False def handle_starttag(self, tag, attrs): if tag in ('p', 'br') and not self.hide_output: self._buf.append('\n') elif tag in ('script', 'style'): self.hide_output = True def handle_startendtag(self, tag, attrs): if tag == 'br': self._buf.append('\n') def handle_endtag(self, tag): if tag == 'p': self._buf.append('\n') elif tag in ('script', 'style'): self.hide_output = False def handle_data(self, text): if text and not self.hide_output: self._buf.append(re.sub(r'\s+', ' ', text)) def handle_entityref(self, name): if name in name2codepoint and not self.hide_output: c = unichr(name2codepoint[name]) self._buf.append(c) def handle_charref(self, name): if not self.hide_output: n = int(name[1:], 16) if name.startswith('x') else int(name) self._buf.append(unichr(n)) def get_text(self): return re.sub(r' +', ' ', ''.join(self._buf)) def html_to_text(html): """ Given a piece of HTML, return the plain text it contains. This handles entities and char refs, but not javascript and stylesheets. """ parser = _HTMLToText() try: parser.feed(html) parser.close() except HTMLParseError: pass return parser.get_text() def text_to_html(text): """ Convert the given text to html, wrapping what looks like URLs with <a> tags, converting newlines to <br> tags and converting confusing chars into html entities. """ def f(mo): t = mo.group() if len(t) == 1: return {'&':'&amp;', "'":'&#39;', '"':'&quot;', '<':'&lt;', '>':'&gt;'}.get(t) return '<a href="%s">%s</a>' % (t, t) return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

也可以在带状图库中使用html2text方法。

 from stripogram import html2text text = html2text(your_html_string)

安装带状图运行sudo easy_install带状图

有数据挖掘模式库。

http://www.clips.ua.ac.be/pages/pattern-web

你甚至可以决定要保留什么标签：

 s = URL('http://www.clips.ua.ac.be').download() s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']}) print s

PyParsing做得很好。保罗麦奎尔有几个脚本，很容易采取在pyparsing维基上的各种用途。（ http://pyparsing.wikispaces.com/Examples ）在pyparsing上投入一点时间的一个原因是，他也编写了一个非常简单的组织良好的O'Reilly Short Cut手册，同样价格低廉。

说了这么多，我使用了BeautifulSOup很多，处理Entitites问题并不难，你可以在运行BeautifulSoup之前将它们转换。

祝你好运

http://pypi.python.org/pypi/webstemmer/0.5.0

http://atropine.sourceforge.net/documentation.html

或者，我认为你可以从蟒蛇驱动l，，搜索

这不完全是一个Python解决方案，但它将文本转换成JavaScript的文本，我认为这是重要的（谷歌google.com）。浏览器链接（不是Lynx）有一个Javascript引擎，并且会使用-dump选项将源代码转换为文本。

所以你可以做这样的事情：

 fname = os.tmpnam() fname.write(html_source) proc = subprocess.Popen(['links', '-dump', fname], stdout=subprocess.PIPE, stderr=open('/dev/null','w')) text = proc.stdout.read()

取代HTMLParser模块，请查看htmllib。它有一个类似的界面，但为你做更多的工作。（这是非常古老的，所以它没有太多的帮助，在摆脱JavaScript和CSS方面，你可以做一个派生类，但添加方法的名字，如start_script和end_style（详见python文档），但很难对于格式不正确的html，可靠地做到这一点）。无论如何，这里有一些简单的东西，将纯文本打印到控制台

 from htmllib import HTMLParser, HTMLParseError from formatter import AbstractFormatter, DumbWriter p = HTMLParser(AbstractFormatter(DumbWriter())) try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

在Python 3.x中，您可以通过导入“imaplib”和“email”包以非常简单的方式完成。虽然这是一个较旧的职位，但也许我的回答可以帮助新来者在这个职位。

 status, data = self.imap.fetch(num, '(RFC822)') email_msg = email.message_from_bytes(data[0][1]) #email.message_from_string(data[0][1]) #If message is multi part we only want the text version of the body, this walks the message and gets the body. if email_msg.is_multipart(): for part in email_msg.walk(): if part.get_content_type() == "text/plain": body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (eg, Base64, uuencode, quoted-printable) body = body.decode() elif part.get_content_type() == "text/html": continue

现在你可以打印身体变量，它将以明文格式:)如果它足够好，那么选择它作为接受的答案将是很好的。

美丽的汤确实转换html实体。这可能是你最好的选择，考虑到HTML通常是越野车，并填写unicode和html编码问题。这是我用来将html转换为原始文本的代码：

 import BeautifulSoup def getsoup(data, to_unicode=False): data = data.replace("&nbsp;", " ") # Fixes for bad markup I've seen in the wild. Remove if not applicable. masssage_bad_comments = [ (re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1)), (re.compile('<!WWWAnswer T[=\w\d\s]*>'), lambda match: '<!--' + match.group(0) + '-->'), ] myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE) myNewMassage.extend(masssage_bad_comments) return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage, convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES if to_unicode else None) remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else ""

我推荐一个名为goose-extractor的Python软件包鹅会尝试提取以下信息：

文章的主要文章文章的主要图像文章中嵌入的任何Youtube / Vimeo电影元描述Meta标签

更多： https ： //pypi.python.org/pypi/goose-extractor/

我知道已经有很多答案了，但是我已经找到了最有效的pythonic解决方案。

 from bs4 import BeautifulSoup text = ''.join(BeautifulSoup(some_html_string).findAll(text=True))

使用安装html2text

pip安装html2text

然后，

 >>> import html2text >>> >>> h = html2text.HTML2Text() >>> # Ignore converting links from HTML >>> h.ignore_links = True >>> print h.handle("<p>Hello, <a href='http://earth.google.com/'>world</a>!") Hello, world!

另一种选择是通过基于文本的Web浏览器运行html并转储它。例如（使用Lynx）：

 lynx -dump html_to_convert.html > converted_html.txt

这可以在python脚本中完成，如下所示：

 import subprocess with open('converted_html.txt', 'w') as outputFile: subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

它不会给你正好从HTML文件中的文本，但根据您的使用情况，它可能是更好的html2text的输出。

另一个非python解决方案：Libre Office：

 soffice --headless --invisible --convert-to txt input1.html

我比其他选择更喜欢这个的原因是，每一个HTML段落都被转换成一个单行的文本行（没有换行符），这就是我正在寻找的。其他方法需要后处理。山猫确实产生了很好的输出，但不完全是我所期待的。此外，自由办公室可以用来从各种格式转换…

如果你需要更多的速度和精度，那么你可以使用原始的lxml。

 import lxml.html as lh from lxml.html.clean import clean_html def lxml_to_text(html): doc = lh.fromstring(html) doc = clean_html(doc) return doc.text_content()

以一种简单的方式

 import re html_text = open('html_file.html').read() text_filtered = re.sub(r'<(.*?)>', '', html_text)

这段代码找到了所有以“<”开始并以“>”结尾的html_text部分，并替换了所有由空字符串找到的部分

我正在实现这样的事情。

 >>> import requests >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> res = requests.get(url) >>> text = res.text

@ PeYoTIL的回答使用BeautifulSoup和消除风格和脚本内容不适合我。我试过使用decompose而不是extract但它仍然没有工作。所以我创建了我自己的，它也使用<p>标签格式化文本，并用href链接替换<a>标签。还应付文本中的链接。在这个要点中嵌入了一个测试文档。

 from bs4 import BeautifulSoup, NavigableString def html_to_text(html): "Creates a formatted text email message as a string from a rendered html template (page)" soup = BeautifulSoup(html, 'html.parser') # Ignore anything in head body, text = soup.body, [] for element in body.descendants: # We use type and not isinstance since comments, cdata, etc are subclasses that we don't want if type(element) == NavigableString: # We use the assumption that other tags can't be inside a script or style if element.parent.name in ('script', 'style'): continue # remove any multiple and leading/trailing whitespace string = ' '.join(element.string.split()) if string: if element.parent.name == 'a': a_tag = element.parent # replace link text with the link string = a_tag['href'] # concatenate with any non-empty immediately previous string if ( type(a_tag.previous_sibling) == NavigableString and a_tag.previous_sibling.string.strip() ): text[-1] = text[-1] + ' ' + string continue elif element.previous_sibling and element.previous_sibling.name == 'a': text[-1] = text[-1] + ' ' + string continue elif element.parent.name == 'p': # Add extra paragraph formatting newline string = '\n' + string text += [string] doc = '\n'.join(text) return doc

任何人都尝试bleach.clean(html,tags=[],strip=True) 漂白剂？它为我工作。

这是我经常使用的代码。

 from bs4 import BeautifulSoup import urllib.request def processText(webpage): # EMPTY LIST TO STORE PROCESSED TEXT proc_text = [] try: news_open = urllib.request.urlopen(webpage.group()) news_soup = BeautifulSoup(news_open, "lxml") news_para = news_soup.find_all("p", text = True) for item in news_para: # SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES para_text = (' ').join((item.text).split()) # COMBINE LINES/PARAGRAPHS INTO A LIST proc_text.append(para_text) except urllib.error.HTTPError: pass return proc_text

我希望有帮助。

使用Python从HTML文件中提取文本

使用C＃清除文本文件的内容

将文本alignment到div的底部

用于读取资源文本文件到string（Java）

如何在JavaScript中获取input文本值

在SQL Server上使用varchar（MAX）与TEXT

ReadAllLines的Stream对象？

写在Java文本文件的开头

我如何用CSS代替文字？

如何在Python中检查文本是否为空（空格，制表符，换行符）？

二进制协议与文本协议