一个整数序列的最佳压缩algorithm

我有一个大范围的大多数连续的整数，例如1-100,110-160等所有整数都是正数。什么是最好的algorithm来压缩这个？

我试过放气algorithm，但是这只给了我50％的压缩。请注意，该algorithm不能是有损的。

所有的数字都是独一无二的，并逐渐增加

另外如果你能指点我这个algorithm的java实现，那将是很棒的。

我们已经写了最近的研究论文，调查这个问题的最佳scheme。请参见：

Daniel Lemire和Leonid Boytsov，通过vector化每秒解码数十亿个整数，软件：实践与经验45（1），2015. http://arxiv.org/abs/1209.2137

Daniel Lemire，Nathan Kurz，Leonid Boytsov，SIMD压缩和sorting整数的交集，软件：实践和经验（出现） http://arxiv.org/abs/1401.6399

它们包括广泛的实验评估。

您可以在线查找C ++ 11中所有技术的完整实现： https ： //github.com/lemire/FastPFor和https://github.com/lemire/SIMDCompressionAndIntersection

还有C库： https ： //github.com/lemire/simdcomp和https://github.com/lemire/MaskedVByte

如果您更喜欢Java，请参阅https://github.com/lemire/JavaFastPFOR

首先，通过获取每个值和前一个值之间的差值（对于第一个值，假设前一个值为零）来预处理值列表。这应该在你的情况下主要是一个序列，可以通过大多数压缩algorithm更容易地压缩。

这就是PNG格式如何改善其压缩（它执行gzip使用的相同压缩algorithm之后的几种差异方法之一）。

那么，我正在为更聪明的方式投票。在这种情况下，你必须存储的是[int：startnumber] [int / byte / whatever：迭代次数]，你将把你的示例数组变成4xInt的值。之后你可以压缩，只要你想:)

虽然您可以devise特定于您的数据stream的自定义algorithm，但使用现成的编码algorithm可能更容易。我在Java中运行了一些可用的压缩algorithmtesting，发现一系列连续整数的压缩率如下：

None 1.0 Deflate 0.50 Filtered 0.34 BZip2 0.11 Lzma 0.06

数字是多less？除了其他的答案之外，你可以考虑base-128变长编码，它允许你在单个字节中存储较小的数字，同时仍然允许更大的数字。 MSB的意思是“有另一个字节” – 这里描述。

结合这一点与其他技术，所以你正在存储“跳过大小”，“采取大小”，“跳过大小”，“采取大小” – 但要注意的是既不“跳过”也不是“采取”将是零，所以我们从每个中减去一个（这可以让您为less数值保存一个额外的字节）

所以：

 1-100, 110-160

是“跳过1”（假设从零开始，因为它使事情变得更容易），“取100”，“跳过9”，“取51”; 从每个中减1，给（作为小数）

 0,99,8,50

编码为（hex）：

 00 63 08 32

如果我们想跳过/取一个更大的数字 – 例如300; 我们减1给299 – 但超过7位; 从小端开始，我们对7位块和MSB进行编码以指示连续：

 299 = 100101100 = (in blocks of 7): 0000010 0101100

所以从小端开始：

 1 0101100 (leading one since continuation) 0 0000010 (leading zero as no more)

赠送：

 AC 02

所以我们可以很容易地对大数字进行编码，但是小数字（这通常用于跳过/取出）占用更less的空间。

你可以尝试通过“放气”来运行，但它可能没有什么帮助。

如果你不想处理所有那些混乱的编码cruff你自己…如果你可以创build值的整数数组（0,99,8,60） – 你可以使用协议缓冲区打包重复uint32 / uint64 – 它会为你做所有的工作; – p

我不“做”Java，但是这是一个完整的C＃实现（借用protobuf-net项目中的一些编码位）：

 using System; using System.Collections.Generic; using System.IO; using System.Linq; static class Program { static void Main() { var data = new List<int>(); data.AddRange(Enumerable.Range(1, 100)); data.AddRange(Enumerable.Range(110, 51)); int[] arr = data.ToArray(), arr2; using (MemoryStream ms = new MemoryStream()) { Encode(ms, arr); ShowRaw(ms.GetBuffer(), (int)ms.Length); ms.Position = 0; // rewind to read it... arr2 = Decode(ms); } } static void ShowRaw(byte[] buffer, int len) { for (int i = 0; i < len; i++) { Console.Write(buffer[i].ToString("X2")); } Console.WriteLine(); } static int[] Decode(Stream stream) { var list = new List<int>(); uint skip, take; int last = 0; while (TryDecodeUInt32(stream, out skip) && TryDecodeUInt32(stream, out take)) { last += (int)skip+1; for(uint i = 0 ; i <= take ; i++) { list.Add(last++); } } return list.ToArray(); } static int Encode(Stream stream, int[] data) { if (data.Length == 0) return 0; byte[] buffer = new byte[10]; int last = -1, len = 0; for (int i = 0; i < data.Length; ) { int gap = data[i] - 2 - last, size = 0; while (++i < data.Length && data[i] == data[i - 1] + 1) size++; last = data[i - 1]; len += EncodeUInt32((uint)gap, buffer, stream) + EncodeUInt32((uint)size, buffer, stream); } return len; } public static int EncodeUInt32(uint value, byte[] buffer, Stream stream) { int count = 0, index = 0; do { buffer[index++] = (byte)((value & 0x7F) | 0x80); value >>= 7; count++; } while (value != 0); buffer[index - 1] &= 0x7F; stream.Write(buffer, 0, count); return count; } public static bool TryDecodeUInt32(Stream source, out uint value) { int b = source.ReadByte(); if (b < 0) { value = 0; return false; } if ((b & 0x80) == 0) { // single-byte value = (uint)b; return true; } int shift = 7; value = (uint)(b & 0x7F); bool keepGoing; int i = 0; do { b = source.ReadByte(); if (b < 0) throw new EndOfStreamException(); i++; keepGoing = (b & 0x80) != 0; value |= ((uint)(b & 0x7F)) << shift; shift += 7; } while (keepGoing && i < 4); if (keepGoing && i == 4) { throw new OverflowException(); } return true; } }

压缩string“1-100,110-160”或以某种二进制表示forms存储该string，并parsing该string以恢复该数组

除了其他解决scheme：

您可以find“密集”区域并使用位图来存储它们。

举个例子：

如果在1000-3000之间的400个范围内有1000个数字，则可以使用一位表示存在一个数字和两个整数表示范围。这个范围的总存储量是2000位+2整数，所以你可以将这些信息存储在254字节中，这是非常棒的，因为即使是短整数也会占用两个字节，所以对于这个例子你可以节省7倍的存储空间。

越密集的地区这个algorithm会做得越好，但在某些时候只是存储开始和结束将会更便宜。

我将结合CesarB和FernandoMiguélez给出的答案。

首先，存储每个值和前一个值之间的差异。正如CesarB指出的，这会给你一个主要的序列。

然后，在这个序列上使用一个运行长度编码压缩algorithm。由于大量的重复值，它会很好地压缩。

我build议看一看算术编码的特例Huffman Coding 。在这两种情况下，你分析你的开始序列，以确定不同值的相对频率。更频繁出现的值比较不频繁出现的值用较less的位编码。

你应该使用的基本思想是，对于连续整数的每个范围（我将调用这些范围）来存储起始数字和范围的大小。例如，如果您有1000个整数的列表，但是只有10个单独的范围，则可以存储仅20个整数（每个范围1个起始数字和1个大小）来表示这个数据，这个数据的压缩率为98 ％。幸运的是，还可以进行一些更多的优化，这将有助于范围数量更大的情况。

存储起始号码相对于前一个起始号码的偏移量 ，而不是起始号码本身。这样做的好处是您存储的数字通常需要更less的位（这可能会在稍后的优化build议中派上用场）。另外，如果只存储了起始数字，这些数字将全部是唯一的，而存储偏移量则有可能使得数字接近甚至重复，这可能允许使用另一种方法进一步压缩。
对于这两种types的整数， 尽可能使用最less的位数 。您可以迭代数字来获得起始整数的最大偏移量以及最大范围的大小。然后，您可以使用最有效地存储这些整数的数据types，并简单地指定压缩数据开始时的数据types或位数。例如，如果起始整数的最大偏移量仅为12,000，最大范围为9,000，则可以使用2个字节的无符号整数。然后，您可以在压缩数据的开头填塞2,2对，以显示两个整数都使用了2个字节。当然，您可以使用一点点操作将这些信息合并到单个字节中。如果您愿意进行大量的重复位操作，则可以将每个数字存储为尽可能less的位数，而不是符合1,2,4或8字节的表示forms。

有了这两个优化，我们来看几个例子（每个是4,000字节）：

1000个整数，最大偏移量是500,10个范围
1,000个整数，最大偏移量是100,50个范围
1,000个整数，最大偏移量是50,100个范围

没有优化

20个整数，每个4个字节= 80个字节。压缩率= 98％
100个整数，每个4个字节= 400个字节。压缩率= 90％
200个整数，每个4个字节= 800个字节。压缩率= 80％

优化

1个字节头+ 20个数字，每个1个字节= 21个字节。压缩率= 99.475％
1个字节标题+100个数字，每个1个字节= 101个字节。 COMPRESSION = 97.475％
1个字节头+ 200个数字，每个1个字节= 201个字节。 COMPRESSION = 94.975％

您的情况与search引擎中的索引压缩非常相似。常用的压缩algorithm是PForDeltaalgorithm和Simple16algorithm。您可以使用kamikaze库来满足您的压缩需求。

如果你有一系列重复的值，RLE是最容易实现的，可以给你一个好的结果。不过，考虑到LZW这个现在没有专利的其他更先进的algorithm，通常可以获得更好的压缩效果。

你可以在这里看看这些和其他无损algorithm。

我知道这是一个旧的消息线程，但我包括我在这里find的SKIP / TAKE想法的个人PHPtesting。我正在呼叫我的STEP（+）/ SPAN（ – ）。也许有人会觉得有帮助。

注：我实现了允许重复整数以及负整数的能力，即使原始问题涉及正整数，非重复整数。随意调整它，如果你想尝试和削减一个或两个字节。

码：

  // $integers_array can contain any integers; no floating point, please. Duplicates okay. $integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35, 68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88, 113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24, 75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13]; // Order from least to greatest... This routine does NOT save original order of integers. sort($integers_array, SORT_NUMERIC); // Start with the least value... NOTE: This removes the first value from the array. $start = $current = array_shift($integers_array); // This caps the end of the array, so we can easily get the last step or span value. array_push($integers_array, $start - 1); // Create the compressed array... $compressed_array = [$start]; foreach ($integers_array as $next_value) { // Range of $current to $next_value is our "skip" range. I call it a "step" instead. $step = $next_value - $current; if ($step == 1) { // Took a single step, wait to find the end of a series of seqential numbers. $current = $next_value; } else { // Range of $start to $current is our "take" range. I call it a "span" instead. $span = $current - $start; // If $span is positive, use "negative" to identify these as sequential numbers. if ($span > 0) array_push($compressed_array, -$span); // If $step is positive, move forward. If $step is zero, the number is duplicate. if ($step >= 0) array_push($compressed_array, $step); // In any case, we are resetting our start of potentialy sequential numbers. $start = $current = $next_value; } } // OPTIONAL: The following code attempts to compress things further in a variety of ways. // A quick check to see what pack size we can use. $largest_integer = max(max($compressed_array),-min($compressed_array)); if ($largest_integer < pow(2,7)) $pack_size = 'c'; elseif ($largest_integer < pow(2,15)) $pack_size = 's'; elseif ($largest_integer < pow(2,31)) $pack_size = 'l'; elseif ($largest_integer < pow(2,63)) $pack_size = 'q'; else die('Too freaking large, try something else!'); // NOTE: I did not implement the MSB feature mentioned by Marc Gravell. // I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it. $packed_string = $pack_size; // Save compressed array to compressed string and binary packed string. $compressed_string = ''; foreach ($compressed_array as $value) { $compressed_string .= ($value < 0) ? $value : '+'.$value; $packed_string .= pack($pack_size, $value); } // We can possibly compress it more with gzip if there are lots of similar values. $gz_string = gzcompress($packed_string); // These were all just size tests I left in for you. $base64_string = base64_encode($packed_string); $gz64_string = base64_encode($gz_string); $compressed_string = trim($compressed_string,'+'); // Don't need leading '+'. echo "<hr>\nOriginal Array has " .count($integers_array) .' elements: {not showing, since I modified the original array directly}'; echo "<br>\nCompressed Array has " .count($compressed_array).' elements: ' .implode(', ',$compressed_array); echo "<br>\nCompressed String has " .strlen($compressed_string).' characters: ' .$compressed_string; echo "<br>\nPacked String has " .strlen($packed_string).' (some probably not printable) characters: ' .$packed_string; echo "<br>\nBase64 String has " .strlen($base64_string).' (all printable) characters: ' .$base64_string; echo "<br>\nGZipped String has " .strlen($gz_string).' (some probably not printable) characters: ' .$gz_string; echo "<br>\nBase64 of GZipped String has " .strlen($gz64_string).' (all printable) characters: ' .$gz64_string; // NOTICE: The following code reverses the process, starting form the $compressed_array. // The first value is always the starting value. $current_value = array_shift($compressed_array); $uncompressed_array = [$current_value]; foreach ($compressed_array as $val) { if ($val < -1) { // For ranges that span more than two values, we have to fill in the values. $range = range($current_value + 1, $current_value - $val - 1); $uncompressed_array = array_merge($uncompressed_array, $range); } // Add the step value to the $current_value $current_value += abs($val); // Add the newly-determined $current_value to our list. If $val==0, it is a repeat! array_push($uncompressed_array, $current_value); } // Display the uncompressed array. echo "<hr>Reconstituted Array has " .count($uncompressed_array).' elements: ' .implode(', ',$uncompressed_array). '<hr>';

OUTPUT：

 -------------------------------------------------------------------------------- Original Array has 63 elements: {not showing, since I modified the original array directly} Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4 Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4 Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈÎ¯€)YšE¨MŠ“^qçºR¬m&Òõ‹%Ê&TFÊ‰ùÀ6ÿÁÁ Æ Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY= -------------------------------------------------------------------------------- Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142 --------------------------------------------------------------------------------

TurboPFor：最快的整数压缩

适用于C / C ++，包括Java Critical Natives / JNI Interface
SIMD加速整数压缩
对已sorting/未sorting的整数列表进行标量+集成（SIMD）差分/锯齿形编码/解码
全范围8/16/32/64位整数列表
直接访问
基准应用程序

我无法让我的压缩比这更好。我通过python解释器生成了我的testing数据，它是从1-100和110-160的换行符分隔的整数列表。我使用实际的程序作为数据的压缩表示。我的压缩文件如下：

 main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

这只是一个Haskell脚本，可以生成可以运行的文件：

 $ runhaskell generator.hs >> data

g.hs文件的大小是54字节，python生成的数据是496字节。这给出了0.10887096774193548作为压缩率。我想有更多的时间可以缩小程序，或者你可以压缩压缩文件（即haskell文件）。

另一种方法可能是保存4个字节的数据。每个序列的最小值和最大值，然后给那些生成函数。虽然，文件的加载将增加更多字符到解压缩器，给解压缩器增加更多的复杂性和更多的字节。再次，我通过一个程序来表示这个非常具体的序列，它并没有概括，它是一个特定于数据的压缩。而且，增加通用性使得解压缩器更大。

另一个问题是，必须有Haskell解释器来运行这个。当我编译这个程序时，它变得更大了。我不知道为什么。 python也有同样的问题，所以也许最好的办法是给这个范围，这样一些程序就可以解压文件了。

一个整数序列的最佳压缩algorithm

Gzip与缩小

我怎样才能自动压缩和最小化ASP.NET MVC应用程序中的JavaScript文件？

启用IIS7 gzip

阻止人们通过表单上传恶意PHP文件

一个有效的短文本string压缩algorithm

使用php下载多个文件作为zip文件

如何使用与.Net兼容的GZIPOutputStream来压缩和解压缩string？

如何以编程方式下载Java中的网页

在缩小过程中排除debuggingJavaScript代码

C＃中的Zip文件夹