以大写字母拆分string

在一组给定的字符出现之前,用什么pythonic方法来分割一个string?

例如,我想在任何大写字母(可能除第一个字母之外)的情况下拆分'TheLongAndWindingRoad' ,并获得['The', 'Long', 'And', 'Winding', 'Road']

编辑:它也应该拆分单个事件,即从'ABC'我想获得['A', 'B', 'C']

不幸的是,在Python中的零宽度匹配是不可能的。 但是你可以使用re.findall来代替:

 >>> import re >>> re.findall('[AZ][^AZ]*', 'TheLongAndWindingRoad') ['The', 'Long', 'And', 'Winding', 'Road'] >>> re.findall('[AZ][^AZ]*', 'ABC') ['A', 'B', 'C'] 
 >>> import re >>> re.findall('[AZ][az]*', 'TheLongAndWindingRoad') ['The', 'Long', 'And', 'Winding', 'Road'] >>> re.findall('[AZ][az]*', 'SplitAString') ['Split', 'A', 'String'] >>> re.findall('[AZ][az]*', 'ABC') ['A', 'B', 'C'] 

如果你想把["It's", 'A', 'Test']分割成["It's", 'A', 'Test']把它改为"[AZ][a-z']*"

这是一个替代正则expression式的解决scheme。 这个问题可以被描述为“如何在每个大写字母之前插入一个空格,然后再进行拆分”:

 >>> s = "TheLongAndWindingRoad ABC A123B45" >>> re.sub( r"([AZ])", r" \1", s).split() ['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45'] 

这具有保留所有非空白字符的优点,而大多数其他解决scheme则不能。

 import re filter(None, re.split("([AZ][^AZ]*)", "TheLongAndWindingRoad")) 

要么

 [s for s in re.split("([AZ][^AZ]*)", "TheLongAndWindingRoad") if s] 

@ChristopheD解决scheme的一个变种

 s = 'TheLongAndWindingRoad' pos = [i for i,e in enumerate(s+'A') if e.isupper()] parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)] print parts 

替代解决scheme(如果你不喜欢明确的正则expression式):

 s = 'TheLongAndWindingRoad' pos = [i for i,e in enumerate(s) if e.isupper()] parts = [] for j in xrange(len(pos)): try: parts.append(s[pos[j]:pos[j+1]]) except IndexError: parts.append(s[pos[j]:]) print parts 
 src = 'TheLongAndWindingRoad' glue = ' ' result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue) 

另一个没有正则expression式和能够保持连续大写,如果想要的话

 def split_on_uppercase(s, keep_contiguous=False): """ Args: s (str): string keep_contiguous (bool): flag to indicate we want to keep contiguous uppercase chars together Returns: """ string_length = len(s) is_lower_around = (lambda: s[i-1].islower() or string_length > (i + 1) and s[i + 1].islower()) start = 0 parts = [] for i in range(1, string_length): if s[i].isupper() and (not keep_contiguous or is_lower_around()): parts.append(s[start: i]) start = i parts.append(s[start:]) return parts >>> split_on_uppercase('theLongWindingRoad') ['the', 'Long', 'Winding', 'Road'] >>> split_on_uppercase('TheLongWindingRoad') ['The', 'Long', 'Winding', 'Road'] >>> split_on_uppercase('TheLongWINDINGRoadT', True) ['The', 'Long', 'WINDING', 'Road', 'T'] >>> split_on_uppercase('ABC') ['A', 'B', 'C'] >>> split_on_uppercase('ABCD', True) ['ABCD'] >>> split_on_uppercase('') [''] >>> split_on_uppercase('hello world') ['hello world'] 

不使用正则expression式或枚举的另一种方法:

 word = 'TheLongAndWindingRoad' list = [x for x in word] for char in list: if char != list[0] and char.isupper(): list[list.index(char)] = ' ' + char fin_list = ''.join(list).split(' ') 

我认为这种方法更简单明了,不需要链接太多的方法,也不需要使用很难理解的长列表理解。

使用enumerateisupper()的替代方法

码:

 strs = 'TheLongAndWindingRoad' ind =0 count =0 new_lst=[] for index, val in enumerate(strs[1:],1): if val.isupper(): new_lst.append(strs[ind:index]) ind=index if ind<len(strs): new_lst.append(strs[ind:]) print new_lst 

输出:

 ['The', 'Long', 'And', 'Winding', 'Road'] 

这可以通过more_itertools.split_before工具来实现。

 import more_itertools as mit iterable = "TheLongAndWindingRoad" [ "".join(i) for i in mit.split_before(iterable, lambda s: s.isupper())] # ['The', 'Long', 'And', 'Winding', 'Road'] 

它也应该分裂单个事件,即从'ABC'我想获得['A', 'B', 'C']

 iterable = "ABC" [ "".join(i) for i in mit.split_before(iterable, lambda s: s.isupper())] # ['A', 'B', 'C'] 

more_itertools是一个包含60多个有用工具的第三方软件包,包括所有原始itertools配方的实现,这些工具more_itertools其手动执行。