在PDF中使用PDFMiner从PDF文件中提取文本？

Python版本2.7

我正在寻找关于如何使用PDFMiner和Python从PDF文件中提取文本的文档或示例。

它看起来像PDFMiner更新其API和所有相关的例子，我发现包含过时的代码（类和方法已经改变）。我发现的库使得从PDF文件中提取文本的任务更容易使用旧的PDFMiner语法，所以我不知道如何做到这一点。

事实上，我只是在看源代码，看看我能否弄清楚。

以下是使用当前版本的PDFMiner从PDF文件中提取文本的工作示例（2016年9月）

 from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage from cStringIO import StringIO def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = file(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text

PDFMiner的结构最近改变了，所以这应该适用于从PDF文件中提取文本。

编辑：仍工作2017年2月1日。

来自DuckPuncher的极好答案，对于Python3来说，确保你安装了pdfminer2，并执行：

 import io from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage def convert_pdf_to_txt(path): rsrcmgr = PDFResourceManager() retstr = io.StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(path, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos = set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text

在PDF中使用PDFMiner从PDF文件中提取文本？

如何使用正则expression式提取子string

如何从.doc＆.docx文件中提取纯文本？（unix）

如何从一系列文本条目中提取常见/重要的短语

高级PDFparsing使用Python（提取文本没有表等）：什么是最好的图书馆？

用于将PDF转换为文本的Python模块

如何在GREP，REGEX或PERL模式下提取string

如何从PDF中提取文本？

在PDF中使用PDFMiner从PDF文件中提取文本？

如何使用正则expression式提取子string

如何从.doc＆.docx文件中提取纯文本？ （unix）

如何从一系列文本条目中提取常见/重要的短语

高级PDFparsing使用Python（提取文本没有表等）：什么是最好的图书馆？

用于将PDF转换为文本的Python模块

如何在GREP，REGEX或PERL模式下提取string

如何从PDF中提取文本？

如何从.doc＆.docx文件中提取纯文本？（unix）