如何使用张量stream中的自定义python函数预取数据

我正在尝试预取训练数据以隐藏I / O延迟。我想写自定义的Python代码，从磁盘加载数据并预处理数据（例如通过添加上下文窗口）。换句话说，一个线程进行数据预处理，另一个线程进行训练。这在TensorFlow中可能吗？

更新：我有一个基于@ mrry示例的工作示例。

import numpy as np import tensorflow as tf import threading BATCH_SIZE = 5 TRAINING_ITERS = 4100 feature_input = tf.placeholder(tf.float32, shape=[128]) label_input = tf.placeholder(tf.float32, shape=[128]) q = tf.FIFOQueue(200, [tf.float32, tf.float32], shapes=[[128], [128]]) enqueue_op = q.enqueue([label_input, feature_input]) label_batch, feature_batch = q.dequeue_many(BATCH_SIZE) c = tf.reshape(feature_batch, [BATCH_SIZE, 128]) + tf.reshape(label_batch, [BATCH_SIZE, 128]) sess = tf.Session() def load_and_enqueue(sess, enqueue_op, coord): with open('dummy_data/features.bin') as feature_file, open('dummy_data/labels.bin') as label_file: while not coord.should_stop(): feature_array = np.fromfile(feature_file, np.float32, 128) if feature_array.shape[0] == 0: print('reach end of file, reset using seek(0,0)') feature_file.seek(0,0) label_file.seek(0,0) continue label_value = np.fromfile(label_file, np.float32, 128) sess.run(enqueue_op, feed_dict={feature_input: feature_array, label_input: label_value}) coord = tf.train.Coordinator() t = threading.Thread(target=load_and_enqueue, args=(sess,enqueue_op, coord)) t.start() for i in range(TRAINING_ITERS): sum = sess.run(c) print('train_iter='+str(i)) print(sum) coord.request_stop() coord.join([t])

这是一个常见的用例，大多数实现都使用TensorFlow的队列将预处理代码与训练代码分离。有一个如何使用队列的教程，但主要步骤如下：

定义一个将缓冲预处理数据的队列q 。 TensorFlow支持简单的tf.FIFOQueue ，它按照它们入队的顺序生成元素，更高级的tf.RandomShuffleQueue以随机顺序生成元素。队列元素是一个或多个张量的元组（可以具有不同的types和形状）。所有队列都支持单元（入enqueue ， dequeue ）和批处理（ enqueue_many ， dequeue_many ）操作，但是使用批处理操作时，必须在构造队列时指定队列元素中每个张量的形状。
构build将预处理元素排入队列的子图。一种方法是定义一个tf.placeholder()操作tf.placeholder() ，对应于单个input示例的张量，然后将它们传递给q.enqueue() 。（如果你的预处理一次产生一个批次，你应该使用q.enqueue_many()代替）。你也可以在这个子图中包含TensorFlow操作。
build立一个执行培训的子图。这看起来像一个普通的TensorFlowgraphics，但是会通过调用q.dequeue_many(BATCH_SIZE)来获得input。
开始你的会议。
创build一个或多个执行预处理逻辑的线程，然后执行入队操作，input预处理数据。您可能会发现tf.train.Coordinator有用的tf.train.Coordinator和tf.train.QueueRunner实用程序类。
正常运行你的训练图（优化器等）。

编辑：这里有一个简单的load_and_enqueue()函数和代码片段，让你开始：

 # Features are length-100 vectors of floats feature_input = tf.placeholder(tf.float32, shape=[100]) # Labels are scalar integers. label_input = tf.placeholder(tf.int32, shape=[]) # Alternatively, could do: # feature_batch_input = tf.placeholder(tf.float32, shape=[None, 100]) # label_batch_input = tf.placeholder(tf.int32, shape=[None]) q = tf.FIFOQueue(100, [tf.float32, tf.int32], shapes=[[100], []]) enqueue_op = q.enqueue([feature_input, label_input]) # For batch input, do: # enqueue_op = q.enqueue_many([feature_batch_input, label_batch_input]) feature_batch, label_batch = q.dequeue_many(BATCH_SIZE) # Build rest of model taking label_batch, feature_batch as input. # [...] train_op = ... sess = tf.Session() def load_and_enqueue(): with open(...) as feature_file, open(...) as label_file: while True: feature_array = numpy.fromfile(feature_file, numpy.float32, 100) if not feature_array: return label_value = numpy.fromfile(feature_file, numpy.int32, 1)[0] sess.run(enqueue_op, feed_dict={feature_input: feature_array, label_input: label_value}) # Start a thread to enqueue data asynchronously, and hide I/O latency. t = threading.Thread(target=load_and_enqueue) t.start() for _ in range(TRAINING_EPOCHS): sess.run(train_op)

换句话说，一个线程进行数据预处理，另一个线程进行训练。这在TensorFlow中可能吗？

是的。 mrry的解决scheme工作，但更简单的存在。

正在提取数据

tf.py_func包装一个python函数并将其用作TensorFlow运算符。所以我们可以每次在sess.run()加载数据。这种方法的问题是数据在sess.run()期间通过主线程加载。

一个最小的例子：

 def get_numpy_tensor(): return np.array([[1,2],[3,4]], dtype=np.float32) tensorflow_tensor = tf.py_func(get_numpy_tensor, [], tf.float32)

一个更复杂的例子：

 def get_numpy_tensors(): # Load data from the disk into numpy arrays. input = np.array([[1,2],[3,4]], dtype=np.float32) target = np.int32(1) return input, target tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32]) tensorflow_input, tensorflow_target = 2*tensorflow_input, 2*tensorflow_target sess = tf.InteractiveSession() numpy_input, numpy_target = sess.run([tensorflow_input, tensorflow_target]) assert np.all(numpy_input==np.array([[2,4],[6,8]])) and numpy_target==2

在另一个线程中预取数据

为了让我们的数据在另一个线程中排队（这样sess.run()不需要等待数据），我们可以在tf.train.batch()的操作符上使用tf.py_func() 。

一个最小的例子：

 tensor_shape = get_numpy_tensor().shape tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32, shapes=[tensor_shape]) # Run `tf.train.start_queue_runners()` once session is created.

如果tensorflow_tensor具有指定的shapes我们可以省略参数shapes ：

 tensor_shape = get_numpy_tensor().shape tensorflow_tensor.set_shape(tensor_shape) tensorflow_tensors = tf.train.batch([tensorflow_tensor], batch_size=32) # Run `tf.train.start_queue_runners()` once session is created.

一个更复杂的例子：

 input_shape, target_shape = (2, 2), () def get_numpy_tensors(): input = np.random.rand(*input_shape).astype(np.float32) target = np.random.randint(10, dtype=np.int32) print('f', end='') return input, target tensorflow_input, tensorflow_target = tf.py_func(get_numpy_tensors, [], [tf.float32, tf.int32]) batch_size = 2 tensorflow_inputs, tensorflow_targets = tf.train.batch([tensorflow_input, tensorflow_target], batch_size, shapes=[input_shape, target_shape], capacity=2) # Internal queue will contain at most `capasity=2` times `batch_size=2` elements `[tensorflow_input, tensorflow_target]`. tensorflow_inputs, tensorflow_targets = 2*tensorflow_inputs, 2*tensorflow_targets sess = tf.InteractiveSession() tf.train.start_queue_runners() # Internally, `tf.train.batch` uses a QueueRunner, so we need to ask tf to start it. for _ in range(10): numpy_inputs, numpy_targets = sess.run([tensorflow_inputs, tensorflow_targets]) assert numpy_inputs.shape==(batch_size, *input_shape) and numpy_targets.shape==(batch_size, *target_shape) print('r', end='') # Prints `fffffrrffrfrffrffrffrffrffrffrf`.

如果get_numpy_tensor()返回一个张量，那么tf.train.batch(..., enqueue_many=True)会有帮助。

如何使用张量stream中的自定义python函数预取数据

正在提取数据

在另一个线程中预取数据

减lessHaskell程序中的垃圾收集暂停时间

如何模拟低带宽，高延迟的环境？

在Android中每10秒钟显示数据

setTimeout与没有延迟立即执行相同的function？

为什么Monitor.PulseAll会在信号线程中产生“阶梯式”延迟模式？

在swift中延迟函数

在两行后面执行之间添加延迟

延迟：hover在CSS3？

如何延迟AngularJS即时search？

如何减lessiOS AVPlayer启动延迟