用Python读大文件的懒惰方法？

我有一个非常大的文件4GB，当我尝试读取它我的电脑挂起。所以我想一个一个地读一遍，每一个处理后的文件都存放在另一个文件中，然后读下一个文件。

有没有什么方法可以yield这些碎片？

我很想有一个懒惰的方法 。

要编写一个懒惰的函数，只需使用yield ：

 def read_in_chunks(file_object, chunk_size=1024): """Lazy function (generator) to read a file piece by piece. Default chunk size: 1k.""" while True: data = file_object.read(chunk_size) if not data: break yield data f = open('really_big_file.dat') for piece in read_in_chunks(f): process_data(piece)

另一个select是使用iter和一个辅助函数：

 f = open('really_big_file.dat') def read1k(): return f.read(1024) for piece in iter(read1k, ''): process_data(piece)

如果该文件是基于行的，则文件对象已经是行的延迟生成器：

 for line in open('really_big_file.dat'): process_data(line)

如果您的计算机，操作系统和Python是64位的 ，那么您可以使用mmap模块将文件的内容映射到内存中，并使用索引和切片进行访问。这里有一个来自文档的例子：

 import mmap with open("hello.txt", "r+") as f: # memory-map the file, size 0 means whole file map = mmap.mmap(f.fileno(), 0) # read content via standard file methods print map.readline() # prints "Hello Python!" # read content via slice notation print map[:5] # prints "Hello" # update content using slice notation; # note that new content must have same size map[6:] = " world!\n" # ... and read again using standard file methods map.seek(0) print map.readline() # prints "Hello world!" # close the map map.close()

如果您的计算机，操作系统或Python是32位的 ，那么mmap大文件可以保留大部分地址空间，并使您的内存程序不堪重负。

file.readlines（）接受一个可选的大小参数，该大小参数近似于返回的行中读取的行数。

 bigfile = open('bigfilename','r') tmp_lines = bigfile.readlines(BUF_SIZE) while tmp_lines: process([line for line in tmp_lines]) tmp_lines = bigfile.readlines(BUF_SIZE)

看看Neopythonic上的这篇文章：“使用Python对2MB内存中的一百万个32位整数进行sorting”

已经有很多很好的答案了，但是最近我遇到了类似的问题，我所需要的解决scheme在这里没有列出，所以我想我可以补充这个线程。

80％的时间，我需要逐行阅读文件。然后，按照这个答案中的build议，你想使用文件对象本身作为懒惰的生成器：

 with open('big.csv') as f: for line in f: process(line)

不过，我最近碰到一个非常大的（几乎）单行csv，其中行分隔符其实不是'\n'而是'|' 。

逐行阅读不是一个选项，但我仍然需要逐行处理它。
转换'|' 到'\n'在处理之前也是不可能的，因为这个csv的一些字段包含'\n' （自由文本用户input）。
使用csv库也被排除了，因为至less在lib的早期版本中，它被硬编码为逐行读取input 。

我想出了以下代码片段：

 def rows(f, chunksize=1024, sep='|'): """ Read a file where the row separator is '|' lazily. Usage: >>> with open('big.csv') as f: >>> for r in rows(f): >>> process(row) """ incomplete_row = None while True: chunk = f.read(chunksize) if not chunk: # End of file if incomplete_row is not None: yield incomplete_row break # Split the chunk as long as possible while True: i = chunk.find(sep) if i == -1: break # If there is an incomplete row waiting to be yielded, # prepend it and set it back to None if incomplete_row is not None: yield incomplete_row + chunk[:i] incomplete_row = None else: yield chunk[:i] chunk = chunk[i+1:] # If the chunk contained no separator, it needs to be appended to # the current incomplete row. if incomplete_row is not None: incomplete_row += chunk else: incomplete_row = chunk

我已经在大文件和不同块大小上成功地testing过它（我甚至尝试了1个字节的块大小，只是为了确保algorithm不依赖于大小）。

 f = ... # file-like object, ie supporting read(size) function and # returning empty string '' when there is nothing to read def chunked(file, chunk_size): return iter(lambda: file.read(chunk_size), '') for data in chunked(f, 65536): # process the data

更新：最好的解决方法在https://stackoverflow.com/a/4566523/38592

我想我们可以这样写：

 def read_file(path, block_size=1024): with open(path, 'rb') as f: while True: piece = f.read(block_size) if piece: yield piece else: return for piece in read_file(path): process_piece(piece)

我不能评论，因为我的信誉低，但SilentGhosts解决scheme应该是更容易file.readlines（[sizehint]）

python文件方法

编辑：SilentGhost是正确的，但是这应该比：

 s = "" for i in xrange(100): s += file.next()

我的情况有点类似不清楚你是否知道块大小（以字节为单位）我通常不知道，但需要的logging（行）数是已知的：

 def get_line(): with open('4gb_file') as file: for i in file: yield i lines_required = 100 gen = get_line() chunk = [i for i, j in zip(gen, range(lines_required))]

更新：谢谢nosklo。这是我的意思。它几乎起作用，除了在“块”之间失去一条线。

 chunk = [next(gen) for i in range(lines_required)]

这个诀窍不会丢失任何线条，但看起来不是很好。

要逐行处理，这是一个优雅的解决scheme：

  def stream_lines(file_name): file = open(file_name) while True: line = file.readline() if not line: file.close() break yield line

只要没有空白行。

你可以使用下面的代码。

 file_obj = open('big_file')

open（）返回一个文件对象

然后使用os.stat获取大小

 file_size = os.stat('big_file').st_size for i in range( file_size/1024): print file_obj.read(1024)

用Python读大文件的懒惰方法？

这个lambda / yield / generator理解是如何工作的？

将发生器拆分成块，不需要预先走线

在列表parsing和生成器expression式中的产量

有没有一个机制在ES6（ECMAScript 6）没有可变variables循环x次？

node.js是否支持yield？

发电机输出的长度

如何outlookPython生成器中的一个元素？

Python：使用recursionalgorithm作为生成器

在Python中是否有`string.split（）`的生成器版本？

为什么列表推导写入循环variables，但生成器不？