php语句边界检测

我想在PHP中将文本分成句子。 我目前正在使用一个正则expression式,它带来了〜95%的准确度,并希望通过使用更好的方法来改善。 我已经看到了在Perl,Java和C中执行这些工具的NLP工具,但没有看到符合PHP的任何东西。 你知道这样的工具吗?

增强的正则expression式解决scheme

假设你关心处理: Mr.Mrs.等缩写,那么下面的单一正则expression式解决scheme工作得很好:

 <?php // test.php Rev:20160820_1800 $split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800) # Split sentences on whitespace between them. # See: http://stackoverflow.com/a/5844564/433790 (?<= # Sentence split location preceded by [.!?] # either an end of sentence punct, | [.!?][\'"] # or end of sentence punct and quote. ) # End positive lookbehind. (?<! # But don\'t split after these: Mr\. # Either "Mr." | Mrs\. # Or "Mrs." | Ms\. # Or "Ms." | Jr\. # Or "Jr." | Dr\. # Or "Dr." | Prof\. # Or "Prof." | Sr\. # Or "Sr." | T\.V\.A\. # Or "TVA" # Or... (you get the idea). ) # End negative lookbehind. \s+ # Split on whitespace between sentences, (?=\S) # (but not at end of string). %xi'; // End $split_sentences. $text = 'This is sentence one. Sentence two! Sentence thr'. 'ee? Sentence "four". Sentence "five"! Sentence "'. 'six"? Sentence "seven." Sentence \'eight!\' Dr. '. 'Jones said: "Mrs. Smith you have a lovely daught'. 'er!" The TVA is a big project! '; // Note ws at end. $sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY); for ($i = 0; $i < count($sentences); ++$i) { printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]); } ?> 

请注意,您可以轻松地从expression式中添加或删除缩写词。 鉴于以下testing段落:

这是一句话。 第二句话! 第三句话? 句子“四”。 句子“五”! 句子“六”? 句子“七” 八句! 琼斯博士说:“史密斯太太你有一个可爱的女儿!” TVA是一个大项目!

这是脚本的输出:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The TVA is a big project!]

基本的正则expression式解决scheme

问题的作者评论说,上述解决scheme“忽略了许多select” ,并且不够通用。 我不确定这意味着什么,但是上面expression的本质就像你所能得到的一样干净和简单。 这里是:

 $re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/'; $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY); 

请注意,两种解决scheme都能正确识别以标点符号结尾的带引号的句子。 如果你不关心匹配以引号结尾的句子,正则expression式可以简化为:/(( /(?<=[.!?])\s+(?=\S)/ !?]) /(?<=[.!?])\s+(?=\S)/

编辑:20130820_1000添加TVA (另一个被忽略的词)以正则expression式和testingstring。 (回答PapyRef的评论问题)

编辑:20130820_1800整理和重命名正则expression式,并添加shebang。 还修复了正则expression式,以防止在末尾的空白处分割文本。

稍微改善别人的工作:

 $re = '/# Split sentences on whitespace between them. (?<= # Begin positive lookbehind. [.!?] # Either an end of sentence punct, | [.!?][\'"] # or end of sentence punct and quote. ) # End positive lookbehind. (?<! # Begin negative lookbehind. Mr\. # Skip either "Mr." | Mrs\. # or "Mrs.", | Ms\. # or "Ms.", | Jr\. # or "Jr.", | Dr\. # or "Dr.", | Prof\. # or "Prof.", | Sr\. # or "Sr.", | \s[AZ]\. # or initials ex: "George W. Bush", # or... (you get the idea). ) # End negative lookbehind. \s+ # Split on whitespace between sentences. /ix'; 
 $sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY); 

作为一种低技术的方法,你可能要考虑在一个循环中使用一系列explode调用,使用。,!和? 作为你的针。 这将是非常大的内存和处理器(如大多数文本处理)。 你将有一堆临时数组和一个主数组,所有发现的句子都按照正确的顺序进行数字索引。

另外,你必须检查一些常见的exception情况(例如Mr.Dr.等标题中的。),但是一切都在数组中,这些types的检查应该不会那么糟糕。

我不确定这在速度和缩放方面是否比正则expression式更好,但是值得一试。 这些文本块要分成多less个句子?

我正在使用这个正则expression式:

 preg_split('/(?<=[.?!])\s(?=[AZ"\'])/', $text); 

不会用一个数字开始的句子,但也应该有很less的误报。 当然你在做什么也很重要。 我的程序现在使用

 explode('.',$text); 

因为我认为速度比准确性更重要。

build立这样的缩写列表

 $skip_array = array ( 'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc. 

将它们编译成一个expression式

 $skip = ''; foreach($skip_array as $abbr) { $skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]'; } 

最后运行这个preg_split分解成句子。

 $lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^az])/", $txt, -1, PREG_SPLIT_NO_EMPTY); 

如果您正在处理HTML,请注意标签被删除,从而消除句子之间的空白。 如果你有situations.Like这样的where.They坚持在一起,这是非常困难的parsing。

@ridgerunner我用C#编写了你的​​PHP代码

结果得到2个句子:

  • J. Dujardin先生是电视节目主持人
  • A. en esp。 uniquement

正确的结果应该是: Dujardin先生TVA en esp。 uniquement

并与我们的testing段落

 string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The TVA is a big project!"; 

结果是

 index: 0 sentence: This is sentence one. index: 22 sentence: Sentence two! index: 36 sentence: Sentence three? index: 52 sentence: Sentence "four". index: 69 sentence: Sentence "five"! index: 86 sentence: Sentence "six"? index: 102 sentence: Sentence "seven. index: 118 sentence: " Sentence 'eight!' index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter! index: 193 sentence: " The TV index: 203 sentence: A. is a big project! 

C#代码:

  string sText = "Mr. J. Dujardin régle sa TVA en esp. uniquement"; Regex rx = new Regex(@"(\S.+? [.!?] # Either an end of sentence punct, | [.!?]['""] # or end of sentence punct and quote. ) (?<! # Begin negative lookbehind. Mr. # Skip either Mr. | Mrs. # or Mrs., | Ms. # or Ms., | Jr. # or Jr., | Dr. # or Dr., | Prof. # or Prof., | Sr. # or Sr., | \s[AZ]. # or initials ex: George W. Bush, | T\.V\.A\. # or "TVA" ) # End negative lookbehind. (?=|\s+|$)", RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled); foreach (Match match in rx.Matches(sText)) { Console.WriteLine("index: {0} sentence: {1}", match.Index, match.Value); }