在C ++中创build稀疏数组的最佳方式是什么？

我正在研究一个需要处理巨大matrix的项目，特别是用于copula计算的金字塔加法。

简而言之，我需要在matrix（multidimensional array）中的零海中追踪相对较less的值（通常值为1，在极less数情况下超过1）。

稀疏数组允许用户存储less量值，并假定所有未定义的logging为预设值。由于在物理上不可能将所有值存储在内存中，因此我只需要存储less量的非零元素。这可能是数百万条目。

速度是一个重要的优先事项，我也想在运行时dynamicselect类中的variables数量。

我目前正在使用二叉search树（b-tree）来存储条目的系统上工作。有谁知道更好的系统？

对于C ++来说，地图运作良好。数百万个物体不会成为问题。在我的电脑上，有一千万件物品花费了大约4.4秒，大约57兆。

我的testing应用程序如下所示：

#include <stdio.h> #include <stdlib.h> #include <map> class triple { public: int x; int y; int z; bool operator<(const triple &other) const { if (x < other.x) return true; if (other.x < x) return false; if (y < other.y) return true; if (other.y < y) return false; return z < other.z; } }; int main(int, char**) { std::map<triple,int> data; triple point; int i; for (i = 0; i < 10000000; ++i) { point.x = rand(); point.y = rand(); point.z = rand(); //printf("%d %d %d %d\n", i, point.x, point.y, point.z); data[point] = i; } return 0; }

现在要dynamicselectvariables的数量，最简单的解决方法是将索引表示为一个string ，然后使用string作为地图的关键字。例如，位于[23] [55]的项目可以通过“23,55”string表示。我们也可以扩展这个解决scheme的更高维度; 如三维任意索引将看起来像“34,45,56”。这种技术的简单实现如下：

 std::map data<string,int> data; char ix[100]; sprintf(ix, "%d,%d", x, y); // 2 vars data[ix] = i; sprintf(ix, "%d,%d,%d", x, y, z); // 3 vars data[ix] = i;

就像一个build议：使用string作为索引的方法实际上是非常缓慢的。一个更高效但是等价的解决scheme是使用向量/数组。绝对不需要在string中编写索引。

 typedef vector<size_t> index_t; struct index_cmp_t : binary_function<index_t, index_t, bool> { bool operator ()(index_t const& a, index_t const& b) const { for (index_t::size_type i = 0; i < a.size(); ++i) if (a[i] != b[i]) return a[i] < b[i]; return false; } }; map<index_t, int, index_cmp_t> data; index_t i(dims); i[0] = 1; i[1] = 2; // … etc. data[i] = 42;

但是，使用map实际上并不是非常有效，因为在平衡二叉search树方面的实现。在这种情况下，更好的执行数据结构将是（随机）散列表。

Boost有一个名为uBLAS的BLAS模板实现，它包含一个稀疏matrix。

http://www.boost.org/doc/libs/1_36_0/libs/numeric/ublas/doc/index.htm

索引比较中的小细节。你需要做一个词典比较，否则：

 a= (1, 2, 1); b= (2, 1, 2); (a<b) == (b<a) is true, but b!=a

编辑：所以比较应该可能是：

 return lhs.x<rhs.x ? true : lhs.x==rhs.x ? lhs.y<rhs.y ? true : lhs.y==rhs.y ? lhs.z<rhs.z : false : false

哈希表有一个快速插入和查找。你可以编写一个简单的哈希函数，因为你知道你只会处理整数对作为键。

Eigen是一个C ++线性代数库，有一个稀疏matrix的实现。它甚至支持对稀疏matrix进行优化的matrix运算和求解器（LU分解等）。

实现稀疏matrix的最好方法是不要实现它们 – 至less不是你自己的。我会build议BLAS（我认为它是LAPACK的一部分）可以处理真正巨大的matrix。

完整的解决scheme列表可以在维基百科中find。为方便起见，我引用了以下相关章节。

https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29

钥匙字典（DOK）

DOK由一个字典组成，该字典映射（行，列）对 – 元素的值。从字典中缺less的元素被认为是零。这种格式适合按照随机顺序递增地构造一个稀疏matrix，但是在字典顺序上对非零值进行迭代的效果很差。一个通常以这种格式构造一个matrix，然后转换为另一个更有效的格式进行处理[1]。

名单清单（LIL）

LIL每行存储一个列表，每个条目包含列索引和值。通常，这些条目保持按列索引sorting，以便更快地查找。这是增量matrix构造的另一种格式[2]。

坐标清单（COO）

COO存储（行，列，值）元组列表。理想情况下，对条目进行sorting（按行索引，然后列索引），以提高随机访问时间。这是增量matrix构造的另一种格式[3]。

压缩稀疏行（CSR，CRS或耶鲁格式）

压缩稀疏行（CSR）或压缩行存储（CRS）格式表示由三个（一维）数组组成的matrixM，其分别包含非零值，行的范围和列索引。它与COO类似，但压缩了行索引，因此名称。这种格式允许快速行访问和matrix向量乘法（Mx）。

因为只有[a] [b] [c] … [w] [x] [y] [z]的值是重要的，所以我们只存储指针本身，而不是存储在任何地方的值1相同+没有办法散列它。注意维度的诅咒是存在的，build议去一些既定的工具NIST或Boost，至less读取来源，以规避不必要的失误。

如果工作需要捕捉未知数据集的时间依赖性分布和参数趋势，那么具有单值根的Map或B-Tree可能是不实际的。我们只能自己存储指针，如果在运行时对于所有的1个值，sorting（sensation for presentation）都可以从属于时间域的减less。由于除了一个以外的非零值是很less的，因此显而易见的候选对象是任何可以容易理解的数据结构。如果数据集真的是庞大的宇宙大小，我build议某种滑动窗口，自己pipe理文件/磁盘/永久性Io，将数据的一部分移动到需要的范围。（编写可以理解的代码）如果您承诺为工作组提供实际的解决scheme，那么不这样做会让您处于消费级操作系统的摆布之中，这些操作系统的唯一目标就是让您的午餐远离您。

这是一个相对简单的实现，应该提供一个合理的快速查找（使用散列表），以及快速迭代行/列中的非零元素。

 // Copyright 2014 Leo Osvald // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. #ifndef UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ #define UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ #include <algorithm> #include <limits> #include <map> #include <type_traits> #include <unordered_map> #include <utility> #include <vector> // A simple time-efficient implementation of an immutable sparse matrix // Provides efficient iteration of non-zero elements by rows/cols, // eg to iterate over a range [row_from, row_to) x [col_from, col_to): // for (int row = row_from; row < row_to; ++row) { // for (auto col_range = sm.nonzero_col_range(row, col_from, col_to); // col_range.first != col_range.second; ++col_range.first) { // int col = *col_range.first; // // use sm(row, col) // ... // } template<typename T = double, class Coord = int> class SparseMatrix { struct PointHasher; typedef std::map< Coord, std::vector<Coord> > NonZeroList; typedef std::pair<Coord, Coord> Point; public: typedef T ValueType; typedef Coord CoordType; typedef typename NonZeroList::mapped_type::const_iterator CoordIter; typedef std::pair<CoordIter, CoordIter> CoordIterRange; SparseMatrix() = default; // Reads a matrix stored in MatrixMarket-like format, ie: // <num_rows> <num_cols> <num_entries> // <row_1> <col_1> <val_1> // ... // Note: the header (lines starting with '%' are ignored). template<class InputStream, size_t max_line_length = 1024> void Init(InputStream& is) { rows_.clear(), cols_.clear(); values_.clear(); // skip the header (lines beginning with '%', if any) decltype(is.tellg()) offset = 0; for (char buf[max_line_length + 1]; is.getline(buf, sizeof(buf)) && buf[0] == '%'; ) offset = is.tellg(); is.seekg(offset); size_t n; is >> row_count_ >> col_count_ >> n; values_.reserve(n); while (n--) { Coord row, col; typename std::remove_cv<T>::type val; is >> row >> col >> val; values_[Point(--row, --col)] = val; rows_[col].push_back(row); cols_[row].push_back(col); } SortAndShrink(rows_); SortAndShrink(cols_); } const T& operator()(const Coord& row, const Coord& col) const { static const T kZero = T(); auto it = values_.find(Point(row, col)); if (it != values_.end()) return it->second; return kZero; } CoordIterRange nonzero_col_range(Coord row, Coord col_from, Coord col_to) const { CoordIterRange r; GetRange(cols_, row, col_from, col_to, &r); return r; } CoordIterRange nonzero_row_range(Coord col, Coord row_from, Coord row_to) const { CoordIterRange r; GetRange(rows_, col, row_from, row_to, &r); return r; } Coord row_count() const { return row_count_; } Coord col_count() const { return col_count_; } size_t nonzero_count() const { return values_.size(); } size_t element_count() const { return size_t(row_count_) * col_count_; } private: typedef std::unordered_map<Point, typename std::remove_cv<T>::type, PointHasher> ValueMap; struct PointHasher { size_t operator()(const Point& p) const { return p.first << (std::numeric_limits<Coord>::digits >> 1) ^ p.second; } }; static void SortAndShrink(NonZeroList& list) { for (auto& it : list) { auto& indices = it.second; indices.shrink_to_fit(); std::sort(indices.begin(), indices.end()); } // insert a sentinel vector to handle the case of all zeroes if (list.empty()) list.emplace(Coord(), std::vector<Coord>(Coord())); } static void GetRange(const NonZeroList& list, Coord i, Coord from, Coord to, CoordIterRange* r) { auto lr = list.equal_range(i); if (lr.first == lr.second) { r->first = r->second = list.begin()->second.end(); return; } auto begin = lr.first->second.begin(), end = lr.first->second.end(); r->first = lower_bound(begin, end, from); r->second = lower_bound(r->first, end, to); } ValueMap values_; NonZeroList rows_, cols_; Coord row_count_, col_count_; }; #endif /* UTIL_IMMUTABLE_SPARSE_MATRIX_HPP_ */

为了简单起见，它是immutable ，但你可以使它变得可变; 如果你想要一个合理的有效的“插入”（把零改成一个非零），一定要把std::vector改成std::vector std::set 。

我会build议做一些事情：

 typedef std::tuple<int, int, int> coord_t; typedef boost::hash<coord_t> coord_hash_t; typedef std::unordered_map<coord_hash_t, int, c_hash_t> sparse_array_t; sparse_array_t the_data; the_data[ { x, y, z } ] = 1; /* list-initialization is cool */ for( const auto& element : the_data ) { int xx, yy, zz, val; std::tie( std::tie( xx, yy, zz ), val ) = element; /* ... */ }

为了保持数据稀疏，您可能需要编写unorderd_map的子类，其迭代器会自动跳过（并擦除）任何值为0的项。

在C ++中创build稀疏数组的最佳方式是什么？

最佳做法：腌制和胡椒粉密码？

bcrypt如何能够内build盐？

如何获得C ++文件的MD5哈希值？

散列字典？

什么是好的散列函数？

如何在没有Iterator的情况下迭代Set / HashSet？

等效的.try（）散列，以避免“未定义的方法”错误的零？

如何生成一个独特的散列码在Android的stringinput…？

是时候（）好盐吗？

C最小的散列函数？