find一个不在40亿给定的整数

这是一个面试问题:

给定一个有40亿个整数的input文件,提供一个algorithm来生成一个不包含在文件中的整数。 假设你有1 GB的内存。 如果你只有10 MB的内存,跟进你会做什么。

我的分析:

文件的大小是4×10 9 ×4字节= 16 GB。

我们可以做外部sorting,因此我们知道整数的范围。 我的问题是什么是检测sorting的大整数集中丢失的整数的最佳方法?

我的理解(阅读所有答案后):

假设我们正在谈论32位整数。 有2 ^ 32 = 4 * 10 9不同的整数。

情况1:我们有1 GB = 1 * 10 9 * 8位= 80亿位内存。 解决scheme:如果我们使用一位代表一个不同的整数,就足够了。 我们不需要sorting。 执行:

int radix = 8; byte[] bitfield = new byte[0xffffffff/radix]; void F() throws FileNotFoundException{ Scanner in = new Scanner(new FileReader("a.txt")); while(in.hasNextInt()){ int n = in.nextInt(); bitfield[n/radix] |= (1 << (n%radix)); } for(int i = 0; i< bitfield.lenght; i++){ for(int j =0; j<radix; j++){ if( (bitfield[i] & (1<<j)) == 0) System.out.print(i*radix+j); } } } 

情况2:10MB存储器= 10×10 6 ×8位= 8000万位

 Solution: For all possible 16-bit prefixes, there are 2^16 number of integers = 65536, we need 2^16 * 4 * 8 = 2 million bits. We need build 65536 buckets. For each bucket, we need 4 bytes holding all possibilities because the worst case is all the 4 billion integers belong to the same bucket. step1: Build the counter of each bucket through the first pass through the file. step2: Scan the buckets, find the first one who has less than 65536 hit. step3: Build new buckets whose high 16-bit prefixes are we found in step2 through second pass of the file step4: Scan the buckets built in step3, find the first bucket which doesnt have a hit. The code is very similar to above one. 

结论:我们通过增加文件传递来减less内存。


对迟到者的澄清:正如问到的那样,问题并不是说文件中只包含一个整数,至less这不是大多数人解释的整数。 但是,评论主题中的很多评论都是关于这个任务的变化。 不幸的是, 引入到评论线索的评论后来被其作者删除了,所以现在看起来孤儿答复只是误解了一切。 这很混乱。 抱歉。

假设“整数”意味着32位 :拥有10 MB的空间足以让您计算input文件中具有任意给定的16位前缀的数量,对于所有可能的16位前缀input文件。 至less有一个水桶的命中不到2 ^ 16次。 做第二遍以找出哪个桶中可能的数字已被使用。

如果这意味着超过32位,但仍然是有限的大小 :如上所述,忽略发生在(有符号或无符号;您的select)32位范围之外的所有input数字。

如果“整数”表示math整数 :读取一次input,并跟踪您所见过的最长号码的最大数字长度。 完成后,输出最大加一个随机数字,再多一位数字。 (文件中的一个数字可能是一个超过10 MB的数字,但是如果input是一个文件,那么至less可以表示任何适合它的长度 )。

统计通知的algorithm使用less于确定性的方法来解决这个问题。

如果允许非常大的整数,那么可以生成一个在O(1)时间内可能是唯一的数字。 一个伪随机的128位整数像一个GUID只会在每四百亿十亿个案例中不到一个与现有的四十亿个整数中的一个相冲突。

如果整数被限制为32位,那么可以使用远小于10 MB的单个通道生成一个可能唯一的数字。 伪随机32位整数将与现有40亿个整数中的一个相冲突的几率约为93%(4e9 / 2 ^ 32)。 1000个伪随机整数将全部相撞的概率小于12000亿十亿(一次碰撞的概率^ 1000)中的一个。 因此,如果一个程序维护一个包含1000个伪随机候选的数据结构,并遍历已知整数,从候选中消除匹配,那么至less可以find一个不在文件中的整数。

关于这个问题的详细讨论已经在Jon Bentley的 “第一章开裂的牡蛎” 节目中被讨论过了。Addison -Wesley pp.3-10

Bentley讨论了几种方法,包括外部sorting,使用多个外部文件的合并sorting等。但是,宾利build议的最好的方法是使用位域的单通道algorithm,他幽默地称之为“奇迹sorting”:)。数字可以表示在:

 4 billion bits = (4000000000 / 8) bytes = about 0.466 GB 

实现bitset的代码很简单:(从解决scheme页面获取 )

 #define BITSPERWORD 32 #define SHIFT 5 #define MASK 0x1F #define N 10000000 int a[1 + N/BITSPERWORD]; void set(int i) { a[i>>SHIFT] |= (1<<(i & MASK)); } void clr(int i) { a[i>>SHIFT] &= ~(1<<(i & MASK)); } int test(int i){ return a[i>>SHIFT] & (1<<(i & MASK)); } 

Bentleyalgorithm对文件进行一次遍历, setarrays中的适当位,然后使用上面的testmacros来检查该arrays,以find缺失的数字。

如果可用内存小于0.466 GB,则Bentleybuild议使用k-passalgorithm,该algorithm根据可用内存将input分为多个范围。 举一个非常简单的例子,如果只有1个字节(即处理8个数字的内存)可用,范围从0到31,我们把它分成0到7,8-15,16-22等等并在32/8 = 4遍中分别处理这个范围。

HTH。

由于问题没有指定我们必须find不在文件中的最小可能数字,所以我们可以生成一个比input文件本身更长的数字。 🙂

对于1 GB RAM,您可以使用位向量。 您需要分配40亿位== 500 MB的字节数组。 对于从input读取的每个数字,将相应的位设置为“1”。 一旦你完成了,遍历这些位,find第一个仍然是“0”的。 它的索引是答案。

如果它们是32位整数(可能是select了接近2 ^ 32的40亿个数字),那么40亿个数字的列表将占可能整数的最多93%(4 * 10 ^ 9 /(2 ^ 32))。 所以如果你创build一个2 ^ 32位的位数组,每一位被初始化为零(这将占用2 ^ 29字节〜5​​00MB的RAM;记住一个字节= 2 ^ 3位= 8位)整数列表,并为每个int设置相应的位数组元素,从0到1; 然后读取您的位数组,并返回第一位,仍然是0。

在内存较less(〜10 MB)的情况下,这种解决scheme需要稍加修改。 对于0到83886079之间的所有数字,10 MB〜83886080位仍然足以完成一个位数组。所以你可以通读你的整数列表。 只logging0到83886079之间的#位。 如果数字是随机分布的, 以极大的可能性(相差100%大约10 ^ -2592069 ),你会发现一个缺失的int)。 事实上,如果你只select1到2048的数字(只有256字节的RAM),你仍然会发现一个缺失的数字,压倒性的百分比(99.999999999999999999999999999999999999999999999999999999999999999995%)。

但是,让我们说,而不是有大约40亿数字; 你有2 ^ 32 – 1的数字和less于10 MB的RAM; 所以任何小范围的整数只有一个不包含数字的可能性很小。

如果保证列表中的每个整数都是唯一的,那么可以对这些数字进行求和,并将总和减去一个#的总和(1/2) (2 ^ 32) (2 ^ 32 – 1)= 9223372034707292160find缺less的int。 但是,如果一个int发生了两次,这个方法将会失败。

但是,你总是可以分而治之。 一个天真的方法,将读通过arrays,并计算在前半部分(0到2 ^ 31-1)和后半部分(2 ^ 31,2 ^ 32)的数字的数量。 然后用较less的数字select范围,重复将该范围除以一半。 (如果(2 ^ 31,2 ^ 32)中有两个较less的数字,那么你的下一个search将计算范围(2 ^ 31,3 * 2 ^ 30-1),(3 * 2 ^ 30, 2 ^ 32),继续重复,直到find一个数字为零的范围,然后得到答案,应该用O(lg N)〜32读取数组。

那种方法效率不高。 我们只在每个步骤中使用两个整数(或者大约8个字节的RAM,一个4字节(32位)整数)。 更好的方法是将其分成sqrt(2 ^ 32)= 2 ^ 16 = 65536个分档,每个分档中有65536个数字。 每个bin需要4个字节来存储它的计数,所以你需要2 ^ 18字节= 256 kB。 所以bin 0是(0到65535 = 2 ^ 16-1),bin 1是(2 ^ 16 = 65536到2 * 2 ^ 16-1 = 131071),bin 2是(2 * 2 ^ 16 = 131072到3 * 2 ^ 16-1 = 196607)。 在Python中你会有这样的东西:

 import numpy as np nums_in_bin = np.zeros(65536, dtype=np.uint32) for N in four_billion_int_array: nums_in_bin[N // 65536] += 1 for bin_num, bin_count in enumerate(nums_in_bin): if bin_count < 65536: break # we have found an incomplete bin with missing ints (bin_num) 

通读大约40亿个整数列表; 并计算每个2 ^ 16个分箱中有多less个int,并find一个incomplete_bin,这个incomplete_bin没有全部的65536个数字。 然后你再读一遍40亿个整数列表。 但是这次只有当整数在这个范围内时才会注意到; 当你find它们时会翻转一下。

 del nums_in_bin # allow gc to free old 256kB array from bitarray import bitarray my_bit_array = bitarray(65536) # 32 kB my_bit_array.setall(0) for N in four_billion_int_array: if N // 65536 == bin_num: my_bit_array[N % 65536] = 1 for i, bit in enumerate(my_bit_array): if not bit: print bin_num*65536 + i break 

为什么这么复杂? 你问一个不存在于文件中的整数?

根据规定的规则,你需要存储的唯一的东西是你到目前为止在文件中遇到的最大的整数。 一旦整个文件被读取,返回一个大于1的数字。

不存在触及maxint或任何事情的风险,因为根据规则,algorithm返回的整数大小或数字没有限制。

这可以使用二进制search的变体在很小的空间中解决。

  1. 从允许的数字范围开始, 04294967295

  2. 计算中点。

  3. 循环遍历文件,计算有多less数字相等,小于或高于中点值。

  4. 如果没有数字是平等的,你就完成了。 中点数是答案。

  5. 否则,请select数目最less的范围,并使用此新范围从步骤2开始重复。

这将需要通过文件多达32个线性扫描,但它将只使用几个字节的内存来存储范围和计数。

这与Henning的解决scheme基本相同,除了它使用两个仓而不是16k。

编辑好吧,这不是完全想通过,因为它假定文件中的整数遵循一些静态分配。 显然他们不需要,但即使如此,一个人应该尝试这个:


有大约43亿32位整数。 我们不知道它们是如何分布在文件中的,但最坏的情况是香农熵最高的那个:平均分配。 在这种情况下,文件中不会出现任何一个整数的可能性是

((2 3 2 -1)/ 2 3 2)4⁰⁰⁰⁰⁰⁰≈.4

香农熵越低,这个概率的平均值就越高,但即使是最坏的情况,我们也有90%的机会在5次随机整数猜测之后find一个不发生的数字。 只需用伪随机生成器创build这样的数字,将它们存储在一个列表中。 然后在int之后读取int并将其与所有猜测进行比较。 当有匹配时,删除这个列表条目。 完成整个档案之后,你有可能会有不止一个猜测。 使用其中的任何一个。 在罕见的情况下(10%甚至在最坏的情况下)没有任何猜测,得到一组新的随机整数,或许这次更多(10-> 99%)。

内存消耗:几十个字节,复杂度:O(n),开销:可以承受的,因为大部分时间将花费在不可避免的硬盘访问上,而不是比较整数。


实际上最坏的情况是,当我们假设静态分布时,每个整数都是最大值。 一次,因为那么只有1 – 4000000000 /2³²≈6%的所有整数不会在文件中出现。 所以你需要更多的猜测,但是这仍然不会损失大量的内存。

如果你有一个从[0,2 ^ x – 1]范围内丢失的整数,那就把它们全部放在一起。 例如:

 >>> 0 ^ 1 ^ 3 2 >>> 0 ^ 1 ^ 2 ^ 3 ^ 4 ^ 6 ^ 7 5 

(我知道这并不能完全回答这个问题,但对于一个非常相似的问题,这是一个很好的答案。)

根据原问题目前的措辞,最简单的解决办法是:

find文件中的最大值,然后加1。

他们可能正在查看是否听说过一个概率Bloom Filter ,它可以非常有效地确定一个值是不是一个大集合的一部分,但是只能以高概率确定它是该集合的成员。

使用一个BitSet 。 以每字节8位打包到BitSet中的40亿整数(假设多达2 ^ 32个整数)为2 ^ 32/2 ^ 3 = 2 ^ 29 =大约0.5Gb。

要添加更多的细节 – 每次读取一个数字,在BitSet中设置相应的位。 然后,通过BitSet来查找不存在的第一个数字。 事实上,你可以通过反复挑选一个随机数字并testing它是否存在来做到这一点。

实际上BitSet.nextClearBit(0)会告诉你第一个未设置的位。

看看BitSet API,它似乎只支持0..MAX_INT,所以你可能需要2个BitSet – 一个用于+数字和一个用于存储数字 – 但是内存要求不会改变。

如果没有大小限制,最快的方法是取出文件的长度,并生成文件的长度+ 1个随机数字(或者只是“11111 …”)。 优点:甚至不需要读取文件,而且可以将内存使用量最小化为零。 缺点:你会打印数十亿的数字。

但是,如果唯一的因素是最小化内存使用,没有别的是重要的,这将是最佳的解决scheme。 它甚至可能让你成为“最严重的规则滥用”奖项。

如果我们假设数字的范围总是2 ^ n(偶数的2),那么排他或将会工作(如另一张海报所示)。 至于为什么,让我们certificate一下:

理论

给定任何有2^n元素且有一个元素缺失的基于0的整数范围,可以通过简单地将已知值进行异或运算来find缺失的元素以产生缺失的数字。

证据

让我们看看n = 2。对于n = 2,我们可以表示4个唯一整数:0,1,2,3。它们的位模式为:

  • 0 – 00
  • 1 – 01
  • 2 – 10
  • 3 – 11

现在,如果我们看,每一个位都设置了两次。 因此,由于它被设置了偶数次,并且排他或数字将产生0.如果缺less单个数字,排他或将产生一个数字,当与缺less的数字排他时将导致因此,缺less的号码和由此产生的排他号码完全相同。 如果我们删除2,结果异或将是10 (或2)。

现在,我们来看看n + 1。 我们把每个比特设置在nx的次数和每个比特设置在n+1 yy的值将等于y = x * 2因为有n+1位设置为0的x元素,以及n+1位设置为1的x元素。由于2x总是偶数,所以n+1将始终将每个位设置为偶数次。

因此,由于n=2n+1工作,异或方法将适用于n>=2所有值。

基于0范围的algorithm

这很简单。 它使用2 * n比特的内存,因此对于任何<= 32的范围,可以使用2 32比特的整数(忽略文件描述符消耗的任何内存)。 它使文件一次通过。

 long supplied = 0; long result = 0; while (supplied = read_int_from_file()) { result = result ^ supplied; } return result; 

基于任意距离的algorithm

只要总范围等于2 ^ n,这个algorithm将适用于任何起始数字到任何结束数字的范围。这基本上重新定义范围使得最小值为0.但是它需要2次通过通过文件(第一个抓取最小值,第二个计算丢失的int)。

 long supplied = 0; long result = 0; long offset = INT_MAX; while (supplied = read_int_from_file()) { if (supplied < offset) { offset = supplied; } } reset_file_pointer(); while (supplied = read_int_from_file()) { result = result ^ (supplied - offset); } return result + offset; 

Arbitrary Ranges

We can apply this modified method to a set of arbitrary ranges, since all ranges will cross a power of 2^n at least once. This works only if there is a single missing bit. It takes 2 passes of an unsorted file, but it will find the single missing number every time:

 long supplied = 0; long result = 0; long offset = INT_MAX; long n = 0; double temp; while (supplied = read_int_from_file()) { if (supplied < offset) { offset = supplied; } } reset_file_pointer(); while (supplied = read_int_from_file()) { n++; result = result ^ (supplied - offset); } // We need to increment n one value so that we take care of the missing // int value n++ while (n == 1 || 0 != (n & (n - 1))) { result = result ^ (n++); } return result + offset; 

Basically, re-bases the range around 0. Then, it counts the number of unsorted values to append as it computes the exclusive-or. Then, it adds 1 to the count of unsorted values to take care of the missing value (count the missing one). Then, keep xoring the n value, incremented by 1 each time until n is a power of 2. The result is then re-based back to the original base. 完成。

Here's the algorithm I tested in PHP (using an array instead of a file, but same concept):

 function find($array) { $offset = min($array); $n = 0; $result = 0; foreach ($array as $value) { $result = $result ^ ($value - $offset); $n++; } $n++; // This takes care of the missing value while ($n == 1 || 0 != ($n & ($n - 1))) { $result = $result ^ ($n++); } return $result + $offset; } 

Fed in an array with any range of values (I tested including negatives) with one inside that range which is missing, it found the correct value each time.

Another Approach

Since we can use external sorting, why not just check for a gap? If we assume the file is sorted prior to the running of this algorithm:

 long supplied = 0; long last = read_int_from_file(); while (supplied = read_int_from_file()) { if (supplied != last + 1) { return last + 1; } last = supplied; } // The range is contiguous, so what do we do here? Let's return last + 1: return last + 1; 

Check the size of the input file, then output any number which is too large to be represented by a file that size. This may seem like a cheap trick, but it's a creative solution to an interview problem, it neatly sidesteps the memory issue, and it's technically O(n).

 void maxNum(ulong filesize) { ulong bitcount = filesize * 8; //number of bits in file for (ulong i = 0; i < bitcount; i++) { Console.Write(9); } } 

Should print 10 bitcount – 1 , which will always be greater than 2 bitcount . Technically, the number you have to beat is 2 bitcount – (4 * 10 9 – 1) , since you know there are (4 billion – 1) other integers in the file, and even with perfect compression they'll take up at least one bit each.

Trick question, unless it's been quoted improperly. Just read through the file once to get the maximum integer n , and return n+1 .

Of course you'd need a backup plan in case n+1 causes an integer overflow.

  • The simplest approach is to find the minimum number in the file, and return 1 less than that. This uses O(1) storage, and O(n) time for a file of n numbers. However, it will fail if number range is limited, which could make min-1 not-a-number.

  • The simple and straightforward method of using a bitmap has already been mentioned. That method uses O(n) time and storage.

  • A 2-pass method with 2^16 counting-buckets has also been mentioned. It reads 2*n integers, so uses O(n) time and O(1) storage, but it cannot handle datasets with more than 2^16 numbers. However, it's easily extended to (eg) 2^60 64-bit integers by running 4 passes instead of 2, and easily adapted to using tiny memory by using only as many bins as fit in memory and increasing the number of passes correspondingly, in which case run time is no longer O(n) but instead is O(n*log n).

  • The method of XOR'ing all the numbers together, mentioned so far by rfrankel and at length by ircmaxell answers the question asked in stackoverflow#35185 , as ltn100 pointed out. It uses O(1) storage and O(n) run time. If for the moment we assume 32-bit integers, XOR has a 7% probability of producing a distinct number. Rationale: given ~ 4G distinct numbers XOR'd together, and ca. 300M not in file, the number of set bits in each bit position has equal chance of being odd or even. Thus, 2^32 numbers have equal likelihood of arising as the XOR result, of which 93% are already in file. Note that if the numbers in file aren't all distinct, the XOR method's probability of success rises.

For some reason, as soon as I read this problem I thought of diagonalization. I'm assuming arbitrarily large integers.

Read the first number. Left-pad it with zero bits until you have 4 billion bits. If the first (high-order) bit is 0, output 1; else output 0. (You don't really have to left-pad: you just output a 1 if there are not enough bits in the number.) Do the same with the second number, except use its second bit. Continue through the file in this way. You will output a 4-billion bit number one bit at a time, and that number will not be the same as any in the file. Proof: it were the same as the nth number, then they would agree on the nth bit, but they don't by construction.

You can use bit flags to mark whether an integer is present or not.

After traversing the entire file, scan each bit to determine if the number exists or not.

Assuming each integer is 32 bit, they will conveniently fit in 1 GB of RAM if bit flagging is done.

Just for the sake of completeness, here is another very simple solution, which will most likely take a very long time to run, but uses very little memory.

Let all possible integers be the range from int_min to int_max , and bool isNotInFile(integer) a function which returns true if the file does not contain a certain integer and false else (by comparing that certain integer with each integer in the file)

 for (integer i = int_min; i <= int_max; ++i) { if (isNotInFile(i)) { return i; } } 

For the 10 MB memory constraint:

  1. Convert the number to its binary representation.
  2. Create a binary tree where left = 0 and right = 1.
  3. Insert each number in the tree using its binary representation.
  4. If a number has already been inserted, the leafs will already have been created.

When finished, just take a path that has not been created before to create the requested number.

4 billion number = 2^32, meaning 10 MB might not be sufficient.

编辑

An optimization is possible, if two ends leafs have been created and have a common parent, then they can be removed and the parent flagged as not a solution. This cuts branches and reduces the need for memory.

EDIT II

There is no need to build the tree completely too. You only need to build deep branches if numbers are similar. If we cut branches too, then this solution might work in fact.

Strip the white space and non numeric characters from the file and append 1. Your file now contains a single number not listed in the original file.

From Reddit by Carbonetc.

I will answer the 1 GB version:

There is not enough information in the question, so I will state some assumptions first:

The integer is 32 bits with range -2,147,483,648 to 2,147,483,647.

Pseudo-code:

 var bitArray = new bit[4294967296]; // 0.5 GB, initialized to all 0s. foreach (var number in file) { bitArray[number + 2147483648] = 1; // Shift all numbers so they start at 0. } for (var i = 0; i < 4294967296; i++) { if (bitArray[i] == 0) { return i - 2147483648; } } 

Bit Elimination

One way is to eliminate bits, however this might not actually yield a result (chances are it won't). Psuedocode:

 long val = 0xFFFFFFFFFFFFFFFF; // (all bits set) foreach long fileVal in file { val = val & ~fileVal; if (val == 0) error; } 

Bit Counts

Keep track of the bit counts; and use the bits with the least amounts to generate a value. Again this has no guarantee of generating a correct value.

Range Logic

Keep track of a list ordered ranges (ordered by start). A range is defined by the structure:

 struct Range { long Start, End; // Inclusive. } Range startRange = new Range { Start = 0x0, End = 0xFFFFFFFFFFFFFFFF }; 

Go through each value in the file and try and remove it from the current range. This method has no memory guarantees, but it should do pretty well.

As long as we're doing creative answers, here is another one.

Use the external sort program to sort the input file numerically. This will work for any amount of memory you may have (it will use file storage if needed). Read through the sorted file and output the first number that is missing.

2 128*10 18 + 1 ( which is (2 8 ) 16*10 18 + 1 ) – cannot it be a universal answer for today? This represents a number that cannot be held in 16 EB file, which is the maximum file size in any current file system.

I think this is a solved problem (see above), but there's an interesting side case to keep in mind because it might get asked:

If there are exactly 4,294,967,295 (2^32 – 1) 32-bit integers with no repeats, and therefore only one is missing, there is a simple solution.

Start a running total at zero, and for each integer in the file, add that integer with 32-bit overflow (effectively, runningTotal = (runningTotal + nextInteger) % 4294967296). Once complete, add 4294967296/2 to the running total, again with 32-bit overflow. Subtract this from 4294967296, and the result is the missing integer.

The "only one missing integer" problem is solvable with only one run, and only 64 bits of RAM dedicated to the data (32 for the running total, 32 to read in the next integer).

Corollary: The more general specification is extremely simple to match if we aren't concerned with how many bits the integer result must have. We just generate a big enough integer that it cannot be contained in the file we're given. Again, this takes up absolutely minimal RAM. See the pseudocode.

 # Grab the file size fseek(fp, 0L, SEEK_END); sz = ftell(fp); # Print a '2' for every bit of the file. for (c=0; c<sz; c++) { for (b=0; b<4; b++) { print "2"; } } 

As Ryan said it basically, sort the file and then go over the integers and when a value is skipped there you have it 🙂

EDIT at downvoters: the OP mentioned that the file could be sorted so this is a valid method.

If you don't assume the 32-bit constraint, just return a randomly generated 64-bit number (or 128-bit if you're a pessimist). The chance of collision is 1 in 2^64/(4*10^9) = 4611686018.4 (roughly 1 in 4 billion). You'd be right most of the time!

(Joking… kind of.)