在Itextsharp中使用ITextExtractionStrategy和LocationTextExtractionStrategy获取string的坐标

我有一个PDF文件,我正在阅读string使用ITextExtractionStrategy.Now从string我正在采取一个子string像My name is XYZ ,需要从PDF文件中获取子string的直angular坐标,但不能做到这一点。我知道LocationTextExtractionStrategy但没有得到如何使用它来获得坐标。

这是代码

 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))); text.Append(currentText); string getcoordinate="My name is XYZ"; 

我怎样才能得到这个子string使用ITEXTSHARP的直angular坐标..

请帮忙。

这是一个非常非常简单的实现版本。

在实施之前,了解PDF文件中“文字”,“段落”,“句子”等概念为零是非常重要的。另外,PDF中的文本不一定是从左到右和从上到下排列的,处理非LTR语言。 “Hello World”这个短语可以写成PDF格式,如下所示:

 Draw H at (10, 10) Draw ell at (20, 10) Draw rld at (90, 10) Draw o Wo at (50, 20) 

它也可以写成

 Draw Hello World at (10,10) 

您需要实现的ITextExtractionStrategy接口有一个名为RenderText的方法,该方法在PDF中为每个文本块调用一次。 注意我说“块”而不是“单词”。 在上面的第一个例子中,这两个单词将被调用四次。 在第二个例子中,这两个词将被调用一次。 这是理解的重要部分。 PDFs没有文字,正因为如此,iTextSharp也没有文字。 “单词”部分是100%由你来解决。

也正如我上面所说的那样,PDF文件没有段落。 要知道这一点的原因是因为PDF文件不能换行到一个新的行。 任何时候你看到的东西看起来像一个段落返回,你实际上看到一个全新的文本绘图命令,具有不同的y坐标作为前一行。 看到这个进一步的讨论 。

下面的代码是一个非常简单的实现。 为此,我已经实现了ITextExtractionStrategy子类化LocationTextExtractionStrategy 。 每次调用RenderText()我都会find当前块的矩形( 在这里使用Mark的代码 )并存储它以备后用。 我正在使用这个简单的帮助类来存储这些块和矩形:

 //Helper class that stores our rectangle and text public class RectAndText { public iTextSharp.text.Rectangle Rect; public String Text; public RectAndText(iTextSharp.text.Rectangle rect, String text) { this.Rect = rect; this.Text = text; } } 

这里是子类:

 public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { //Hold each coordinate public List<RectAndText> myPoints = new List<RectAndText>(); //Automatically called for each chunk of text in the PDF public override void RenderText(TextRenderInfo renderInfo) { base.RenderText(renderInfo); //Get the bounding box for the chunk of text var bottomLeft = renderInfo.GetDescentLine().GetStartPoint(); var topRight = renderInfo.GetAscentLine().GetEndPoint(); //Create a rectangle from it var rect = new iTextSharp.text.Rectangle( bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2] ); //Add this to our main collection this.myPoints.Add(new RectAndText(rect, renderInfo.GetText())); } } 

最后是上面的实现:

 //Our test file var testFile = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "test.pdf"); //Create our test file, nothing special using (var fs = new FileStream(testFile, FileMode.Create, FileAccess.Write, FileShare.None)) { using (var doc = new Document()) { using (var writer = PdfWriter.GetInstance(doc, fs)) { doc.Open(); doc.Add(new Paragraph("This is my sample file")); doc.Close(); } } } //Create an instance of our strategy var t = new MyLocationTextExtractionStrategy(); //Parse page 1 of the document above using (var r = new PdfReader(testFile)) { var ex = PdfTextExtractor.GetTextFromPage(r, 1, t); } //Loop through each chunk found foreach (var p in t.myPoints) { Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom)); } 

我不能强调,上述考虑“文字”,这将取决于你。 传递给RenderTextTextRenderInfo对象有一个名为GetCharacterRenderInfos()的方法,您可能可以使用该方法获取更多信息。 如果您不关心字体中GetBaseline() instead of下拉字符,您可能还想使用GetBaseline() instead of GetDescentLine()。

编辑

(我吃了一顿丰盛的午餐,所以我感觉更有帮助。)

下面是MyLocationTextExtractionStrategy的更新版本,它可以做我下面的评论,也就是说,它需要一个string来search并search那个string的每个块。 由于列出的所有原因,这在某些/许多/大部分/全部情况下不起作用。 如果子string在一个块中存在多次,它也只会返回第一个实例。 连字和变音符号也可能混淆了这一点。

 public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy { //Hold each coordinate public List<RectAndText> myPoints = new List<RectAndText>(); //The string that we're searching for public String TextToSearchFor { get; set; } //How to compare strings public System.Globalization.CompareOptions CompareOptions { get; set; } public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None) { this.TextToSearchFor = textToSearchFor; this.CompareOptions = compareOptions; } //Automatically called for each chunk of text in the PDF public override void RenderText(TextRenderInfo renderInfo) { base.RenderText(renderInfo); //See if the current chunk contains the text var startPosition = System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions); //If not found bail if (startPosition < 0) { return; } //Grab the individual characters var chars = renderInfo.GetCharacterRenderInfos().Skip(startPosition).Take(this.TextToSearchFor.Length).ToList(); //Grab the first and last character var firstChar = chars.First(); var lastChar = chars.Last(); //Get the bounding box for the chunk of text var bottomLeft = firstChar.GetDescentLine().GetStartPoint(); var topRight = lastChar.GetAscentLine().GetEndPoint(); //Create a rectangle from it var rect = new iTextSharp.text.Rectangle( bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2] ); //Add this to our main collection this.myPoints.Add(new RectAndText(rect, this.TextToSearchFor)); } 

你会像以前一样使用它,但现在构造函数只有一个必需的参数:

 var t = new MyLocationTextExtractionStrategy("sample"); 

这是一个古老的问题,但我在这里留下我的回应,因为我无法在网上find正确的答案。

正如克里斯·哈斯(Chris Haas)所揭示的,iText处理大块文字并不容易。 克里斯在我的大部分testing中失败的代码,因为一个词通常被分成不同的块(他在这个post中提到了这个问题)。

为了解决这个问题,我使用了这个策略:

  1. 以字符分割块(实际上每个字符为textrenderinfo对象)
  2. 按行分组字符。 这不是直接的,因为你必须处理块alignment。
  3. search每行需要查找的单词

我在这里留下的代码。 我用几个文件testing它,它工作得很好,但在某些情况下可能会失败,因为这个块有点棘手 – >单词转换。

希望有助于某人。

  class LocationTextExtractionStrategyEx : LocationTextExtractionStrategy { private List<LocationTextExtractionStrategyEx.ExtendedTextChunk> m_DocChunks = new List<ExtendedTextChunk>(); private List<LocationTextExtractionStrategyEx.LineInfo> m_LinesTextInfo = new List<LineInfo>(); public List<SearchResult> m_SearchResultsList = new List<SearchResult>(); private String m_SearchText; public const float PDF_PX_TO_MM = 0.3528f; public float m_PageSizeY; public LocationTextExtractionStrategyEx(String sSearchText, float fPageSizeY) : base() { this.m_SearchText = sSearchText; this.m_PageSizeY = fPageSizeY; } private void searchText() { foreach (LineInfo aLineInfo in m_LinesTextInfo) { int iIndex = aLineInfo.m_Text.IndexOf(m_SearchText); if (iIndex != -1) { TextRenderInfo aFirstLetter = aLineInfo.m_LineCharsList.ElementAt(iIndex); SearchResult aSearchResult = new SearchResult(aFirstLetter, m_PageSizeY); this.m_SearchResultsList.Add(aSearchResult); } } } private void groupChunksbyLine() { LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk1 = null; LocationTextExtractionStrategyEx.LineInfo textInfo = null; foreach (LocationTextExtractionStrategyEx.ExtendedTextChunk textChunk2 in this.m_DocChunks) { if (textChunk1 == null) { textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2); this.m_LinesTextInfo.Add(textInfo); } else if (textChunk2.sameLine(textChunk1)) { textInfo.appendText(textChunk2); } else { textInfo = new LocationTextExtractionStrategyEx.LineInfo(textChunk2); this.m_LinesTextInfo.Add(textInfo); } textChunk1 = textChunk2; } } public override string GetResultantText() { groupChunksbyLine(); searchText(); //In this case the return value is not useful return ""; } public override void RenderText(TextRenderInfo renderInfo) { LineSegment baseline = renderInfo.GetBaseline(); //Create ExtendedChunk ExtendedTextChunk aExtendedChunk = new ExtendedTextChunk(renderInfo.GetText(), baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth(), renderInfo.GetCharacterRenderInfos().ToList()); this.m_DocChunks.Add(aExtendedChunk); } public class ExtendedTextChunk { public string m_text; private Vector m_startLocation; private Vector m_endLocation; private Vector m_orientationVector; private int m_orientationMagnitude; private int m_distPerpendicular; private float m_charSpaceWidth; public List<TextRenderInfo> m_ChunkChars; public ExtendedTextChunk(string txt, Vector startLoc, Vector endLoc, float charSpaceWidth,List<TextRenderInfo> chunkChars) { this.m_text = txt; this.m_startLocation = startLoc; this.m_endLocation = endLoc; this.m_charSpaceWidth = charSpaceWidth; this.m_orientationVector = this.m_endLocation.Subtract(this.m_startLocation).Normalize(); this.m_orientationMagnitude = (int)(Math.Atan2((double)this.m_orientationVector[1], (double)this.m_orientationVector[0]) * 1000.0); this.m_distPerpendicular = (int)this.m_startLocation.Subtract(new Vector(0.0f, 0.0f, 1f)).Cross(this.m_orientationVector)[2]; this.m_ChunkChars = chunkChars; } public bool sameLine(LocationTextExtractionStrategyEx.ExtendedTextChunk textChunkToCompare) { return this.m_orientationMagnitude == textChunkToCompare.m_orientationMagnitude && this.m_distPerpendicular == textChunkToCompare.m_distPerpendicular; } } public class SearchResult { public int iPosX; public int iPosY; public SearchResult(TextRenderInfo aCharcter, float fPageSizeY) { //Get position of upperLeft coordinate Vector vTopLeft = aCharcter.GetAscentLine().GetStartPoint(); //PosX float fPosX = vTopLeft[Vector.I1]; //PosY float fPosY = vTopLeft[Vector.I2]; //Transform to mm and get y from top of page iPosX = Convert.ToInt32(fPosX * PDF_PX_TO_MM); iPosY = Convert.ToInt32((fPageSizeY - fPosY) * PDF_PX_TO_MM); } } public class LineInfo { public string m_Text; public List<TextRenderInfo> m_LineCharsList; public LineInfo(LocationTextExtractionStrategyEx.ExtendedTextChunk initialTextChunk) { this.m_Text = initialTextChunk.m_text; this.m_LineCharsList = initialTextChunk.m_ChunkChars; } public void appendText(LocationTextExtractionStrategyEx.ExtendedTextChunk additionalTextChunk) { m_LineCharsList.AddRange(additionalTextChunk.m_ChunkChars); this.m_Text += additionalTextChunk.m_text; } } } 

我知道这是一个非常古老的问题,但下面是我最终做的。 只是在这里张贴它希望这将是有用的别人。

以下代码将告诉您包含search文本的行的起始坐标。 不应该很难将其修改为给位置的文字。 注意。 我在itextsharp 5.5.11.0上testing了这个版本,并且在一些老版本上不能运行

如上所述,PDF不具有文字/行或段落的概念。 但是我发现LocationTextExtractionStrategy在分割线条和文字方面做得非常好。 所以我的解决scheme就是基于这个。

免责声明:

此解决scheme基于https://github.com/itext/itextsharp/blob/develop/src/core/iTextSharp/text/pdf/parser/LTextTextExtractionStrategy.cs ,该文件有一条评论,说这是一个开发预览。 所以这可能不会在未来。

无论如何,这是代码。

 using System.Collections.Generic; using iTextSharp.text.pdf.parser; namespace Logic { public class LocationTextExtractionStrategyWithPosition : LocationTextExtractionStrategy { private readonly List<TextChunk> locationalResult = new List<TextChunk>(); private readonly ITextChunkLocationStrategy tclStrat; public LocationTextExtractionStrategyWithPosition() : this(new TextChunkLocationStrategyDefaultImp()) { } /** * Creates a new text extraction renderer, with a custom strategy for * creating new TextChunkLocation objects based on the input of the * TextRenderInfo. * @param strat the custom strategy */ public LocationTextExtractionStrategyWithPosition(ITextChunkLocationStrategy strat) { tclStrat = strat; } private bool StartsWithSpace(string str) { if (str.Length == 0) return false; return str[0] == ' '; } private bool EndsWithSpace(string str) { if (str.Length == 0) return false; return str[str.Length - 1] == ' '; } /** * Filters the provided list with the provided filter * @param textChunks a list of all TextChunks that this strategy found during processing * @param filter the filter to apply. If null, filtering will be skipped. * @return the filtered list * @since 5.3.3 */ private List<TextChunk> filterTextChunks(List<TextChunk> textChunks, ITextChunkFilter filter) { if (filter == null) { return textChunks; } var filtered = new List<TextChunk>(); foreach (var textChunk in textChunks) { if (filter.Accept(textChunk)) { filtered.Add(textChunk); } } return filtered; } public override void RenderText(TextRenderInfo renderInfo) { LineSegment segment = renderInfo.GetBaseline(); if (renderInfo.GetRise() != 0) { // remove the rise from the baseline - we do this because the text from a super/subscript render operations should probably be considered as part of the baseline of the text the super/sub is relative to Matrix riseOffsetTransform = new Matrix(0, -renderInfo.GetRise()); segment = segment.TransformBy(riseOffsetTransform); } TextChunk tc = new TextChunk(renderInfo.GetText(), tclStrat.CreateLocation(renderInfo, segment)); locationalResult.Add(tc); } public IList<TextLocation> GetLocations() { var filteredTextChunks = filterTextChunks(locationalResult, null); filteredTextChunks.Sort(); TextChunk lastChunk = null; var textLocations = new List<TextLocation>(); foreach (var chunk in filteredTextChunks) { if (lastChunk == null) { //initial textLocations.Add(new TextLocation { Text = chunk.Text, X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]), Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1]) }); } else { if (chunk.SameLine(lastChunk)) { var text = ""; // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space if (IsChunkAtWordBoundary(chunk, lastChunk) && !StartsWithSpace(chunk.Text) && !EndsWithSpace(lastChunk.Text)) text += ' '; text += chunk.Text; textLocations[textLocations.Count - 1].Text += text; } else { textLocations.Add(new TextLocation { Text = chunk.Text, X = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[0]), Y = iTextSharp.text.Utilities.PointsToMillimeters(chunk.Location.StartLocation[1]) }); } } lastChunk = chunk; } //now find the location(s) with the given texts return textLocations; } } public class TextLocation { public float X { get; set; } public float Y { get; set; } public string Text { get; set; } } } 

如何调用该方法:

  using (var reader = new PdfReader(inputPdf)) { var parser = new PdfReaderContentParser(reader); var strategy = parser.ProcessContent(pageNumber, new LocationTextExtractionStrategyWithPosition()); var res = strategy.GetLocations(); reader.Close(); } var searchResult = res.Where(p => p.Text.Contains(searchText)).OrderBy(p => pY).Reverse().ToList(); inputPdf is a byte[] that has the pdf data pageNumber is the page where you want to search in