在C ++中转换matrix的最快方法是什么？

我有一个matrix（比较大），我需要转置。例如，假设我的matrix是

abcdef ghijkl mnopqr

我想要的结果如下：

 agm bhn c I o djp ekq flr

什么是最快的方法来做到这一点？

这是一个很好的问题。有很多原因需要实际将matrix转置到内存中，而不是交换坐标，例如在matrix乘法和高斯拖尾中。

首先让我列出我用于转置的function之一（ 编辑：请参阅我的答案的结尾，我发现一个更快的解决scheme ）

 void transpose(float *src, float *dst, const int N, const int M) { #pragma omp parallel for for(int n = 0; n<N*M; n++) { int i = n/N; int j = n%N; dst[n] = src[M*j + i]; } }

现在让我们来看看为什么转置是有用的。考虑matrix乘法C = A * B。我们可以这样做。

 for(int i=0; i<N; i++) { for(int j=0; j<K; j++) { float tmp = 0; for(int l=0; l<M; l++) { tmp += A[M*i+l]*B[K*l+j]; } C[K*i + j] = tmp; } }

那样的话，会有很多的caching未命中。一个更快的解决scheme是首先进行B的转置

 transpose(B); for(int i=0; i<N; i++) { for(int j=0; j<K; j++) { float tmp = 0; for(int l=0; l<M; l++) { tmp += A[M*i+l]*B[K*j+l]; } C[K*i + j] = tmp; } } transpose(B);

matrix乘法是O（n ^ 3），转置是O（n ^ 2），因此，转置对计算时间（对于大n ）应该有微不足道的影响。在matrix乘法中，循环平铺比采用转置更有效，但是这更复杂。

我希望我知道更快的方法来做转置（ 编辑：我find了一个更快的解决scheme，看到我的答案的结束 ）。当Haswell / AVX2在几周内出现时，它将具有收集function。我不知道在这种情况下这是否会有所帮助，但我可以通过图像收集专栏并写出一行。也许它会使转置不必要。

对于高斯涂抹你所做的是水平涂抹，然后垂直拖尾。但是，垂直拖尾有caching问题，所以你所做的是

 Smear image horizontally transpose output Smear output horizontally transpose output

以下是英特尔解释说http://software.intel.com/zh-cn/articles/iir-gaussian-blur-filter-implementation-using-intel-advanced-vector-extensions

最后，我实际上在matrix乘法（和高斯拖尾）中所做的并不是完全采用转置，而是采用某个向量大小（例如，对于SSE / AVX为4或8）的宽度的转置。这是我使用的function

 void reorder_matrix(const float* A, float* B, const int N, const int M, const int vec_size) { #pragma omp parallel for for(int n=0; n<M*N; n++) { int k = vec_size*(n/N/vec_size); int i = (n/vec_size)%N; int j = n%vec_size; B[n] = A[M*i + k + j]; } }

编辑：

我尝试了几个函数来find大matrix的最快转置。最后最快的结果是使用block_size=16循环阻塞（ 编辑：我发现使用SSE和循环阻塞的更快的解决scheme – 见下文 ）。此代码适用于任何NxMmatrix（即matrix不必是方形的）。

 inline void transpose_scalar_block(float *A, float *B, const int lda, const int ldb, const int block_size) { #pragma omp parallel for for(int i=0; i<block_size; i++) { for(int j=0; j<block_size; j++) { B[j*ldb + i] = A[i*lda +j]; } } } inline void transpose_block(float *A, float *B, const int n, const int m, const int lda, const int ldb, const int block_size) { #pragma omp parallel for for(int i=0; i<n; i+=block_size) { for(int j=0; j<m; j+=block_size) { transpose_scalar_block(&A[i*lda +j], &B[j*ldb + i], lda, ldb, block_size); } } }

值lda和ldb是matrix的宽度。这些需要是块大小的倍数。要find这些值并为例如3000x1001matrix分配内存，我可以这样做

 #define ROUND_UP(x, s) (((x)+((s)-1)) & -(s)) const int n = 3000; const int m = 1001; int lda = ROUND_UP(m, 16); int ldb = ROUND_UP(n, 16); float *A = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64); float *B = (float*)_mm_malloc(sizeof(float)*lda*ldb, 64);

对于3000×1001，这将返回ldb = 3008和lda = 1008

编辑：

我发现使用SSE内在函数更快的解决scheme：

 inline void transpose4x4_SSE(float *A, float *B, const int lda, const int ldb) { __m128 row1 = _mm_load_ps(&A[0*lda]); __m128 row2 = _mm_load_ps(&A[1*lda]); __m128 row3 = _mm_load_ps(&A[2*lda]); __m128 row4 = _mm_load_ps(&A[3*lda]); _MM_TRANSPOSE4_PS(row1, row2, row3, row4); _mm_store_ps(&B[0*ldb], row1); _mm_store_ps(&B[1*ldb], row2); _mm_store_ps(&B[2*ldb], row3); _mm_store_ps(&B[3*ldb], row4); } inline void transpose_block_SSE4x4(float *A, float *B, const int n, const int m, const int lda, const int ldb ,const int block_size) { #pragma omp parallel for for(int i=0; i<n; i+=block_size) { for(int j=0; j<m; j+=block_size) { int max_i2 = i+block_size < n ? i + block_size : n; int max_j2 = j+block_size < m ? j + block_size : m; for(int i2=i; i2<max_i2; i2+=4) { for(int j2=j; j2<max_j2; j2+=4) { transpose4x4_SSE(&A[i2*lda +j2], &B[j2*ldb + i2], lda, ldb); } } } } }

这将取决于您的应用程序，但总的来说，转换matrix的最快方法是在查找时反转坐标，然后不必实际移动任何数据。

关于将x86硬件转换为4×4平方浮点（我将在后面讨论32位整数）matrix的一些细节。从这里开始为了转置更大的方形matrix（如8×8或16×16）是有帮助的。

_MM_TRANSPOSE4_PS(r0, r1, r2, r3)由不同的编译器实现。 GCC和ICC（我没有检查铛）使用unpcklps, unpckhps, unpcklpd, unpckhpd而MSVC只使用shufps 。实际上，我们可以将这两种方法结合在一起。

 t0 = _mm_unpacklo_ps(r0, r1); t1 = _mm_unpackhi_ps(r0, r1); t2 = _mm_unpacklo_ps(r2, r3); t3 = _mm_unpackhi_ps(r2, r3); r0 = _mm_shuffle_ps(t0,t2, 0x44); r1 = _mm_shuffle_ps(t0,t2, 0xEE); r2 = _mm_shuffle_ps(t1,t3, 0x44); r3 = _mm_shuffle_ps(t1,t3, 0xEE);

一个有趣的观察是，两个洗牌可以转换为一个洗牌和两个混合（SSE4.1）这样的。

 t0 = _mm_unpacklo_ps(r0, r1); t1 = _mm_unpackhi_ps(r0, r1); t2 = _mm_unpacklo_ps(r2, r3); t3 = _mm_unpackhi_ps(r2, r3); v = _mm_shuffle_ps(t0,t2, 0x4E); r0 = _mm_blend_ps(t0,v, 0xC); r1 = _mm_blend_ps(t2,v, 0x3); v = _mm_shuffle_ps(t1,t3, 0x4E); r2 = _mm_blend_ps(t1,v, 0xC); r3 = _mm_blend_ps(t3,v, 0x3);

这有效地将4个洗牌转换为2个洗牌和4个混合。这使用了比GCC，ICC和MSVC的实现更多的2个指令。其优点是可以减less在某些情况下可能有利的端口压力。目前所有的洗牌和解包只能到一个特定的端口，而混合可以去两个不同的端口中的任何一个。

我尝试使用8个洗牌像MSVC和转换成4洗牌+ 8混合，但它没有奏效。我仍然必须使用4个解包。

我用这个相同的技术进行8×8浮点转置（见该答案的结尾）。 https://stackoverflow.com/a/25627536/2542702 。在这个答案中，我仍然不得不使用8个解包，但我pipe理8个洗牌转换为4洗牌和8混合。

对于32位整数，没有什么比shufps （ shufps的128位混洗除外），所以它只能用解包来实现，我认为它不能被转换成混合（高效）。使用AVX512 vshufi32x4行为有效像shufps除了128位的4个整数，而不是32位的浮点数，所以这种技术可能与vshufi32x4在某些情况下。与骑士登陆洗牌比混合物慢四倍（吞吐量）。

 template <class T> void transpose( std::vector< std::vector<T> > a, std::vector< std::vector<T> > b, int width, int height) { for (int i = 0; i < width; i++) { for (int j = 0; j < height; j++) { b[j][i] = a[i][j]; } } }

考虑每一行作为一列，每列作为一行..使用j，我而不是我，j

演示： http ： //ideone.com/lvsxKZ

 #include <iostream> using namespace std; int main () { char A [3][3] = { { 'a', 'b', 'c' }, { 'd', 'e', 'f' }, { 'g', 'h', 'i' } }; cout << "A = " << endl << endl; // print matrix A for (int i=0; i<3; i++) { for (int j=0; j<3; j++) cout << A[i][j]; cout << endl; } cout << endl << "A transpose = " << endl << endl; // print A transpose for (int i=0; i<3; i++) { for (int j=0; j<3; j++) cout << A[j][i]; cout << endl; } return 0; }

转换没有任何开销（类不完整）：

 class Matrix{ double *data; //suppose this will point to data double _get1(int i, int j){return data[i*M+j];} //used to access normally double _get2(int i, int j){return data[j*N+i];} //used when transposed public: int M, N; //dimensions double (*get_p)(int, int); //functor to access elements Matrix(int _M,int _N):M(_M), N(_N){ //allocate data get_p=&Matrix::_get1; // initialised with normal access } double get(int i, int j){ //there should be a way to directly use get_p to call. but i think even this //doesnt incur overhead because it is inline and the compiler should be intelligent //enough to remove the extra call return (this->*get_p)(i,j); } void transpose(){ //twice transpose gives the original if(get_p==&Matrix::get1) get_p=&Matrix::_get2; else get_p==&Matrix::_get1; swap(M,N); } }

可以这样使用：

 Matrix M(100,200); double x=M.get(17,45); M.transpose(); x=M.get(17,45); // = original M(45,17)

当然，我并不在乎这里的memory management，这是至关重要的，但却是不同的话题。

我认为最快的方式不应该高于O（n ^ 2），这样就可以使用O（1）空间：
要做到这一点的方法是成对交换，因为当你转换matrix时，你所做的是：M [i] [j] = M [j] [i]，所以把M [i] [j]那么M [i] [j] = M [j] [i]，最后一步：M [j] [i] = temp。这可以通过一次完成，所以它应该采取O（n ^ 2）

我的答案是由3x3matrix转置

  #include<iostream.h> #include<math.h> main() { int a[3][3]; int b[3]; cout<<"You must give us an array 3x3 and then we will give you Transposed it "<<endl; for(int i=0;i<3;i++) { for(int j=0;j<3;j++) { cout<<"Enter a["<<i<<"]["<<j<<"]: "; cin>>a[i][j]; } } cout<<"Matrix you entered is :"<<endl; for (int e = 0 ; e < 3 ; e++ ) { for ( int f = 0 ; f < 3 ; f++ ) cout << a[e][f] << "\t"; cout << endl; } cout<<"\nTransposed of matrix you entered is :"<<endl; for (int c = 0 ; c < 3 ; c++ ) { for ( int d = 0 ; d < 3 ; d++ ) cout << a[d][c] << "\t"; cout << endl; } return 0; }

在C ++中转换matrix的最快方法是什么？

在Bash中转换文件的有效方法

转置列表的列表

转置一个NumPy数组

转置/解压缩函数（zip的反转）？

将行整理，整理并转换为列

Excel：在列/行/matrix之间转换数据的公式

在JavaScript中转换2D数组

Javascript的等效Python的zip函数

获取由vector向量表示的matrix的第一列