如何使用iTextsharp突出显示PDF文件中的文本或单词?

我需要search一个现有的PDF文件中的单词,我想突出显示文本或单词

并保存PDF文件

我有一个想法,使用PdfAnnotation.CreateMarkup我们可以find文本的位置,我们可以添加bgcolor它…但我不知道如何实现它:(

请帮我一下

这是“听起来容易,但实际上很复杂”的东西之一。 看马克的post在这里和这里 。 最终你可能会指向LocationTextExtractionStrategy 。 祝你好运! 如果你真的知道如何做到这一点,有几个人想知道你在想什么!

我发现如何做到这一点,以防万一有人需要从PDF文档中获取单词或句子的位置(坐标),你会发现这个例子项目在这里 ,我用VB.NET 2010。 请记住在此项目中添加对iTextSharp DLL的引用。

我添加了我自己的TextExtraction策略类,基于类LocationTextExtractionStrategy。 我专注于TextChunks,因为他们已经有了这些坐标。

有一些已知的限制,如:

  • 没有多行search(短语),只允许字符/单词或单行句子。
  • 它不会使用旋转的文字。
  • 我没有testing与横向页面方向的PDF文件,但我想这可能需要一些修改。
  • 如果您需要在水印上绘制HighLight / rectangles,则需要添加/修改某些代码,但只需在Form中添加代码,这与文本/位置提取过程无关。

@Jcis,我实际上pipe理一个解决scheme,以您的示例作为起点处理多个search。 我使用你的项目作为ac#项目的参考,并改变它的作用。 而不是突出显示,我实际上已经在search项周围绘制了一个白色矩形,然后使用矩形坐标放置一个表单域。 我也不得不把内容写作模式转换成getovercontent,这样我才能完全阻止search到的文本。 我实际上做的是创build一个search条件的string数组,然后使用for循环,我创build尽可能多的不同的文本字段,我所需要的。

  Test.Form1 formBuilder = new Test.Form1(); string[] fields = new string[] { "%AccountNumber%", "%MeterNumber%", "%EmailFieldHolder%", "%AddressFieldHolder%", "%EmptyFieldHolder%", "%CityStateZipFieldHolder%", "%emptyFieldHolder1%", "%emptyFieldHolder2%", "%emptyFieldHolder3%", "%emptyFieldHolder4%", "%emptyFieldHolder5%", "%emptyFieldHolder6%", "%emptyFieldHolder7%", "%emptyFieldHolder8%", "%SiteNameFieldHolder%", "%SiteNameFieldHolderWithExtraSpace%" }; //int a = 0; for (int a = 0; a < fields.Length; ) { string[] fieldNames = fields[a].Split('%'); string[] fieldName = Regex.Split(fieldNames[1], "Field"); formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, htmlToPdf, finalhtmlToPdf, fieldName[0]); File.Delete(htmlToPdf); System.Array.Clear(fieldNames, 0, 2); System.Array.Clear(fieldName, 0, 1); a++; if (a == fields.Length) { break; } string[] fieldNames1 = fields[a].Split('%'); string[] fieldName1 = Regex.Split(fieldNames1[1], "Field"); formBuilder.PDFTextGetter(fields[a], StringComparison.CurrentCultureIgnoreCase, finalhtmlToPdf, htmlToPdf, fieldName1[0]); File.Delete(finalhtmlToPdf); System.Array.Clear(fieldNames1, 0, 2); System.Array.Clear(fieldName1, 0, 1); a++; } 

它在两个文件之间来回跳动PDFTextGetter函数,直到完成产品。 它工作得很好,没有你最初的项目,这是不可能的,所以谢谢你。 我也改变了你的VB做这样的文本字段映射;

  For Each rect As iTextSharp.text.Rectangle In MatchesFound cb.Rectangle(rect.Left, rect.Bottom + 1, rect.Width, rect.Height + 4) Dim field As New TextField(stamper.Writer, rect, FieldName & Fields) Dim form = stamper.AcroFields Dim fieldKeys = form.Fields.Keys stamper.AddAnnotation(field.GetTextField(), page) Fields += 1 Next 

刚才想到的是,我会分享我作为骨干力量做你的项目。 它甚至会增加字段名称,因为我需要它们。 我也不得不为你的函数添加一个新的参数,但这不值得在这里列出。 再次感谢你的这个伟大的开局。

谢谢Jcis!

经过几个小时的研究和思考,我发现你的解决scheme,这帮助我解决了我的问题。

有2个小错误。

第一:读者需要closures压模,否则会抛出exception。

 Public Sub PDFTextGetter(ByVal pSearch As String, ByVal SC As StringComparison, ByVal SourceFile As String, ByVal DestinationFile As String) Dim stamper As iTextSharp.text.pdf.PdfStamper = Nothing Dim cb As iTextSharp.text.pdf.PdfContentByte = Nothing Me.Cursor = Cursors.WaitCursor If File.Exists(SourceFile) Then Dim pReader As New PdfReader(SourceFile) stamper = New iTextSharp.text.pdf.PdfStamper(pReader, New System.IO.FileStream(DestinationFile, FileMode.Create)) PB.Value = 0 : PB.Maximum = pReader.NumberOfPages For page As Integer = 1 To pReader.NumberOfPages Dim strategy As myLocationTextExtractionStrategy = New myLocationTextExtractionStrategy 'cb = stamper.GetUnderContent(page) cb = stamper.GetOverContent(page) Dim state As New PdfGState() state.FillOpacity = 0.3F cb.SetGState(state) 'Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases strategy.UndercontentCharacterSpacing = cb.CharacterSpacing strategy.UndercontentHorizontalScaling = cb.HorizontalScaling 'It's not really needed to get the text back, but we have to call this line ALWAYS, 'because it triggers the process that will get all chunks from PDF into our strategy Object Dim currentText As String = PdfTextExtractor.GetTextFromPage(pReader, page, strategy) 'The real getter process starts in the following line Dim MatchesFound As List(Of iTextSharp.text.Rectangle) = strategy.GetTextLocations(pSearch, SC) 'Set the fill color of the shapes, I don't use a border because it would make the rect bigger 'but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover cb.SetColorFill(BaseColor.PINK) 'MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color: For Each rect As iTextSharp.text.Rectangle In MatchesFound ' cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height) cb.SaveState() cb.SetColorFill(BaseColor.YELLOW) cb.Rectangle(rect.Left, rect.Bottom, rect.Width, rect.Height) cb.Fill() cb.RestoreState() Next 'cb.Fill() PB.Value = PB.Value + 1 Next stamper.Close() pReader.Close() End If Me.Cursor = Cursors.Default End Sub 

第二:当search到的文本位于提取文本的最后一行时,您的解决scheme不工作。

  Public Function GetTextLocations(ByVal pSearchString As String, ByVal pStrComp As System.StringComparison) As List(Of iTextSharp.text.Rectangle) Dim FoundMatches As New List(Of iTextSharp.text.Rectangle) Dim sb As New StringBuilder() Dim ThisLineChunks As List(Of TextChunk) = New List(Of TextChunk) Dim bStart As Boolean, bEnd As Boolean Dim FirstChunk As TextChunk = Nothing, LastChunk As TextChunk = Nothing Dim sTextInUsedChunks As String = vbNullString ' For Each chunk As TextChunk In locationalResult For j As Integer = 0 To locationalResult.Count - 1 Dim chunk As TextChunk = locationalResult(j) If chunk.text.Contains(pSearchString) Then Thread.Sleep(1) End If If ThisLineChunks.Count > 0 AndAlso (Not chunk.SameLine(ThisLineChunks.Last) Or j = locationalResult.Count - 1) Then If sb.ToString.IndexOf(pSearchString, pStrComp) > -1 Then Dim sLine As String = sb.ToString 'Check how many times the Search String is present in this line: Dim iCount As Integer = 0 Dim lPos As Integer lPos = sLine.IndexOf(pSearchString, 0, pStrComp) Do While lPos > -1 iCount += 1 If lPos + pSearchString.Length > sLine.Length Then Exit Do Else lPos = lPos + pSearchString.Length lPos = sLine.IndexOf(pSearchString, lPos, pStrComp) Loop 'Process each match found in this Text line: Dim curPos As Integer = 0 For i As Integer = 1 To iCount Dim sCurrentText As String, iFromChar As Integer, iToChar As Integer iFromChar = sLine.IndexOf(pSearchString, curPos, pStrComp) curPos = iFromChar iToChar = iFromChar + pSearchString.Length - 1 sCurrentText = vbNullString sTextInUsedChunks = vbNullString FirstChunk = Nothing LastChunk = Nothing 'Get first and last Chunks corresponding to this match found, from all Chunks in this line For Each chk As TextChunk In ThisLineChunks sCurrentText = sCurrentText & chk.text 'Check if we entered the part where we had found a matching String then get this Chunk (First Chunk) If Not bStart AndAlso sCurrentText.Length - 1 >= iFromChar Then FirstChunk = chk bStart = True End If 'Keep getting Text from Chunks while we are in the part where the matching String had been found If bStart And Not bEnd Then sTextInUsedChunks = sTextInUsedChunks & chk.text End If 'If we get out the matching String part then get this Chunk (last Chunk) If Not bEnd AndAlso sCurrentText.Length - 1 >= iToChar Then LastChunk = chk bEnd = True End If 'If we already have first and last Chunks enclosing the Text where our String pSearchString has been found 'then it's time to get the rectangle, GetRectangleFromText Function below this Function, there we extract the pSearchString locations If bStart And bEnd Then FoundMatches.Add(GetRectangleFromText(FirstChunk, LastChunk, pSearchString, sTextInUsedChunks, iFromChar, iToChar, pStrComp)) curPos = curPos + pSearchString.Length bStart = False : bEnd = False Exit For End If Next Next End If sb.Clear() ThisLineChunks.Clear() End If ThisLineChunks.Add(chunk) sb.Append(chunk.text) Next Return FoundMatches End Function 

我将Jcis的VB项目转换成WpfApplication C# (在谷歌驱动器中的文件),甚至应用鲍里斯的错误修正,但项目不运行。 如果有人了解程序的algorithm,修复它是非常感激的。