读取一个文本文件,并将其拆分成python中的单个单词

所以我有这个文本文件由数字和单词组成,例如像这样 – 09807754 18 n 03 aristocrat 0 blue_blood 0 patrician ,我想分裂它,以便每个单词或数字将出现一个新的行。

一个空白分隔符将是理想的,因为我想用破折号的话保持连接。

这是我迄今为止:

 f = open('words.txt', 'r') for word in f: print(word) 

不太确定如何从这里走,我想这是成果:

 09807754 18 n 3 aristocrat ... 

如果你的数据没有引号:

 with open('words.txt','r') as f: for line in f: for word in line.split(): print(word) 

如果您想在文件的每一行中使用单词的嵌套列表:

 with open("words.txt") as f: [line.split() for line in f] 

或者,如果您想将其压缩成文件中的单个单词列表,则可以这样做:

 with open("words.txt") as f: [word for line in f for word in line.split()] 

如果你想要一个正则expression式的解决scheme:

 import re with open("words.txt") as f: for line in f: for word in re.findall(r'\w+', line): # word by word 

或者,如果你想这是一个逐行生成器与正则expression式:

  with open("words.txt") as f: (word for line in f for word in re.findall(r'\w+', line)) 
 f = open('words.txt') for word in f.read().split(): print(word) 

作为补充,如果您正在读取一个vvvvery大文件,并且不想一次将所有内容读入内存,则可以考虑使用缓冲区 ,然后通过yield返回每个单词:

 def read_words(inputfile): with open(inputfile, 'r') as f: while True: buf = f.read(10240) if not buf: break # make sure we end on a space (word boundary) while not str.isspace(buf[-1]): ch = f.read(1) if not ch: break buf += ch words = buf.split() for word in words: yield word yield '' #handle the scene that the file is empty if __name__ == "__main__": for word in read_words('./very_large_file.txt'): process(word) 

这是我完全function的方法,避免了不得不阅读和拆分线。 它使用了itertools模块:

注意python 3,用mapreplaceitertools.imap

 import itertools def readwords(mfile): byte_stream = itertools.groupby( itertools.takewhile(lambda c: bool(c), itertools.imap(mfile.read, itertools.repeat(1))), str.isspace) return ("".join(group) for pred, group in byte_stream if not pred) 

示例用法:

 >>> import sys >>> for w in readwords(sys.stdin): ... print (w) ... I really love this new method of reading words in python I really love this new method of reading words in python It's soo very Functional! It's soo very Functional! >>> 

我想你的情况,这将是使用该function的方式:

 with open('words.txt', 'r') as f: for word in readwords(f): print(word)