我怎样才能从Python的文件/stream中懒散地读取多个JSON对象？

我想从Python文件/stream中读取多个JSON对象，一次一个。不幸的是， json.load()只是.read() s直到文件结束; 似乎没有什么方法可以用它来读取单个对象或懒惰地迭代对象。

有没有办法做到这一点？使用标准库将是理想的，但如果有第三方库，我会使用它。

目前我把每个对象放在一个单独的行，并使用json.loads(f.readline()) ，但我真的不希望这样做。

示例使用

example.py

 import my_json as json import sys for o in json.iterload(sys.stdin): print("Working on a", type(o))

in.txt

 {"foo": ["bar", "baz"]} 1 2 [] 4 5 6

示例会话

 $ python3.2 example.py < in.txt Working on a dict Working on a int Working on a int Working on a list Working on a int Working on a int Working on a int

这是一个非常简单得多的解决scheme。秘密是尝试，失败，并使用exception中的信息正确parsing。唯一的限制是文件必须是可search的。

 def stream_read_json(fn): import json start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except json.JSONDecodeError as e: f.seek(start_pos) json_str = f.read(e.pos) obj = json.loads(json_str) start_pos += e.pos yield obj

编辑：只是注意到，这将只适用于Python> = 3.5。对于之前的失败返回一个ValueError，你必须从stringparsing出位置，例如

 def stream_read_json(fn): import json import re start_pos = 0 with open(fn, 'r') as f: while True: try: obj = json.load(f) yield obj return except ValueError as e: f.seek(start_pos) end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)', e.args[0]).groups()[0]) json_str = f.read(end_pos) obj = json.loads(json_str) start_pos += end_pos yield obj

对于这种增量使用，JSON通常不是很好。没有标准的方法来串行化多个对象，以便一次可以轻松地加载一个对象，而无需分析整个对象。

您正在使用的每行对象解决scheme也可以在其他位置看到。 Scrapy把它称为'JSON行'：

你可以稍微用Python来做：

 for jsonline in f: yield json.loads(jsonline) # or do the processing in this loop

我认为这是最好的方法 – 它不依赖于任何第三方库，而且很容易理解正在发生的事情。我也在自己的代码中使用了它。

当然你可以做到这一点。你只需要直接去raw_decode 。这个实现将整个文件加载到内存中，并对该string进行操作（就像json.load一样）; 如果你有大文件，你可以修改它，只需要从文件中读取，没有太大的困难。

 import json from json.decoder import WHITESPACE def iterload(string_or_fp, cls=json.JSONDecoder, **kwargs): if isinstance(string_or_fp, file): string = string_or_fp.read() else: string = str(string_or_fp) decoder = cls(**kwargs) idx = WHITESPACE.match(string, 0).end() while idx < len(string): obj, end = decoder.raw_decode(string, idx) yield obj idx = WHITESPACE.match(string, end).end()

用法：正如你所要求的，这是一个发电机。

这实际上是一个非常讨厌的问题，因为你必须stream水线，但模式匹配大括号的多行，而且模式匹配JSON。这是一种json-preparse，然后是jsonparsing。与其他格式相比，Json容易parsing，因此并不总是需要去parsing库，但是，我们应该如何解决这些相互冲突的问题呢？

发电机来救援！

发电机对于这样一个问题的美妙之处在于，你可以将它们叠加在一起，逐渐抽象出问题的难度，同时保持懒惰。我也考虑过使用机制将值传递回生成器（send（）），但幸运的是我不需要使用它。

要解决第一个问题，您需要某种types的streamingfinditer，作为re.finditer的stream媒体版本。我在下面的尝试根据需要拉行（取消注释debugging语句看到），而仍然返回匹配。然后，我实际上稍微修改它以产生不匹配的行以及匹配（在得到的元组的第一部分中标记为0或1）。

 import re def streamingfinditer(pat,stream): for s in stream: # print "Read next line: " + s while 1: m = re.search(pat,s) if not m: yield (0,s) break yield (1,m.group()) s = re.split(pat,s,1)[1]

这样，就可以匹配直到花括号，每次计算花括号是否平衡，然后根据需要返回简单或复合对象。

 braces='{}[]' whitespaceesc=' \t' bracesesc='\\'+'\\'.join(braces) balancemap=dict(zip(braces,[1,-1,1,-1])) bracespat='['+bracesesc+']' nobracespat='[^'+bracesesc+']*' untilbracespat=nobracespat+bracespat def simpleorcompoundobjects(stream): obj = "" unbalanced = 0 for (c,m) in streamingfinditer(re.compile(untilbracespat),stream): if (c == 0): # remainder of line returned, nothing interesting if (unbalanced == 0): yield (0,m) else: obj += m if (c == 1): # match returned if (unbalanced == 0): yield (0,m[:-1]) obj += m[-1] else: obj += m unbalanced += balancemap[m[-1]] if (unbalanced == 0): yield (1,obj) obj=""

这返回元组如下：

 (0,"String of simple non-braced objects easy to parse") (1,"{ 'Compound' : 'objects' }")

基本上这是完成的讨厌的部分。我们现在只需要做我们认为合适的parsing的最终级别。例如，我们可以使用Jeremy Roman的iterload函数（谢谢！）来parsing一行代码：

 def streamingiterload(stream): for c,o in simpleorcompoundobjects(stream): for x in iterload(o): yield x

testing它：

 of = open("test.json","w") of.write("""[ "hello" ] { "goodbye" : 1 } 1 2 { } 2 9 78 4 5 { "animals" : [ "dog" , "lots of mice" , "cat" ] } """) of.close() // open & stream the json f = open("test.json","r") for o in streamingiterload(f.readlines()): print o f.close()

我得到了这些结果（如果你打开那个debugging行，你会看到它根据需要拉入行）：

 [u'hello'] {u'goodbye': 1} 1 2 {} 2 9 78 4 5 {u'animals': [u'dog', u'lots of mice', u'cat']}

这不适用于所有情况。由于json库的实现，如果不自己重新实现parsing器，就不可能完全正确地工作。

也许有点晚，但我有这个确切的问题（或多或less）。我对这些问题的标准解决scheme通常是对一些众所周知的根对象进行正则expression式分割，但在我的情况下是不可能的。一般来说，唯一可行的方法是实现一个合适的标记器 。

在找不到通用的，性能合理的解决scheme之后，我自己完成了这个任务，编写了splitstream模块。这是一个pre-tokenizer，它可以理解JSON和XML，并将连续的stream拆分成多个块进行parsing（但是实际上，parsing起来还是不错的）。为了获得某种性能，它被写成一个C模块。

例：

 from splitstream import splitfile for jsonstr in splitfile(sys.stdin, format="json")): yield json.loads(jsonstr)

我想提供一个解决scheme。关键的思想是“尝试”解码：如果失败，给它更多的馈送，否则使用偏移信息准备下一个解码。

但是现在的json模块不能容忍string头部的SPACE被解码，所以我不得不将它们去掉。

 import sys import json def iterload(file): buffer = "" dec = json.JSONDecoder() for line in file: buffer = buffer.strip(" \n\r\t") + line.strip(" \n\r\t") while(True): try: r = dec.raw_decode(buffer) except: break yield r[0] buffer = buffer[r[1]:].strip(" \n\r\t") for o in iterload(sys.stdin): print("Working on a", type(o), o)

=========================我已经testing了几个txt文件，它工作正常。（in1.txt）

 {"foo": ["bar", "baz"] } 1 2 [ ] 4 {"foo1": ["bar1", {"foo2":{"A":1, "B":3}, "DDD":4}] } 5 6

（in2.txt）

 {"foo" : ["bar", "baz"] } 1 2 [ ] 4 5 6

（in.txt，你的最初）

 {"foo": ["bar", "baz"]} 1 2 [] 4 5 6

（输出本尼迪克特的testing用例）

 python test.py < in.txt ('Working on a', <type 'list'>, [u'hello']) ('Working on a', <type 'dict'>, {u'goodbye': 1}) ('Working on a', <type 'int'>, 1) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'dict'>, {}) ('Working on a', <type 'int'>, 2) ('Working on a', <type 'int'>, 9) ('Working on a', <type 'int'>, 78) ('Working on a', <type 'int'>, 4) ('Working on a', <type 'int'>, 5) ('Working on a', <type 'dict'>, {u'animals': [u'dog', u'lots of mice', u'cat']})

你不能用标准库来做到这一点。我浏览了json模块的源代码，如果没有重新实现大部分代码，就不可能使用它。

我用@ wuilang的优雅的解决scheme。简单的方法 – 读一个字节，尝试解码，读取一个字节，尝试解码，… – 工作，但不幸的是，它非常缓慢。

在我的情况下，我试图从文件中读取相同对象types的“漂亮的”JSON对象。这使我可以优化方法; 我可以一行一行地读取文件，只有当我find一行完全包含“}”的行时才解码：

 def iterload(stream): buf = "" dec = json.JSONDecoder() for line in stream: line = line.rstrip() buf = buf + line if line == "}": yield dec.raw_decode(buf) buf = ""

如果你碰巧正在使用一行紧凑的JSON来避免string文字中的换行符，那么你可以安全地简化这种方法：

 def iterload(stream): dec = json.JSONDecoder() for line in stream: yield dec.raw_decode(line)

显然，这些简单的方法只适用于特定种类的JSON。但是，如果这些假设成立，这些解决scheme就能正确快速地工作。

这是我的：

 import simplejson as json from simplejson import JSONDecodeError class StreamJsonListLoader(): """ When you have a big JSON file containint a list, such as [{ ... }, { ... }, { ... }, ... ] And it's too big to be practically loaded into memory and parsed by json.load, This class comes to the rescue. It lets you lazy-load the large json list. """ def __init__(self, filename_or_stream): if type(filename_or_stream) == str: self.stream = open(filename_or_stream) else: self.stream = filename_or_stream if not self.stream.read(1) == '[': raise NotImplementedError('Only JSON-streams of lists (that start with a [) are supported.') def __iter__(self): return self def next(self): read_buffer = self.stream.read(1) while True: try: json_obj = json.loads(read_buffer) if not self.stream.read(1) in [',',']']: raise Exception('JSON seems to be malformed: object is not followed by comma (,) or end of list (]).') return json_obj except JSONDecodeError: next_char = self.stream.read(1) read_buffer += next_char while next_char != '}': next_char = self.stream.read(1) if next_char == '': raise StopIteration read_buffer += next_char

我怎样才能从Python的文件/stream中懒散地读取多个JSON对象？

示例使用

example.py

in.txt

示例会话

C / C ++中double / floattypes的二进制序列化的可移植性

集合属性的XML反序列化与代码默认值

在C＃中序列化匿名代表

分析完成之前遇到的stream结束？

DataContractJsonSerializer和JavaScriptSerializer有什么区别？

你可以检测一个你反序列化的对象是否缺lessJson.NET中的JsonConvert类的字段

如何将一个JObject反序列化为.NET对象

使用jQuery将表单数据转换为JavaScript对象

如何使$ .serialize（）考虑到那些禁用：input元素？

使用StringWriter进行XML序列化