为给定的string生成所有唯一的子string

给定一个strings ,生成一组所有唯一子串的最快方法是什么?

例如:对于str = "aba"我们将得到substrs={"a", "b", "ab", "ba", "aba"}

最初的algorithm是在每次迭代中遍历整个string长度为1..n子串,产生一个O(n^2)上界。

是否有更好的界限?

(这在技术上是功课,所以指针也是受欢迎的)

正如其他海报所说,给定的string可能有O(n ^ 2)个子string,所以将它们打印出来不可能比这更快。 然而,存在一个可以线性时间构造的集合的高效表示: 后缀树 。

由于在一个string中总共有O(n 2 )个子string,所以没有办法比O(n 2 )快得多,所以如果你必须全部生成它们,它们的数目将是n(n + 1) / 2在最坏的情况下,因此O(n 2 Ω(n 2 )的上限

首先是复杂度为O(N ^ 3)的复杂度为O(N ^ 2)的复杂度为O(N ^ 2)的复杂度为O(N ^ 2 log(N)所有的最坏情况O(N ^ 2)和最好情况O(N Log(N))的给定string的后缀。

第一个解决scheme:

 import java.util.Scanner; public class DistinctSubString { public static void main(String[] args) { Scanner in = new Scanner(System.in); System.out.print("Enter The string"); String s = in.nextLine(); long startTime = System.currentTimeMillis(); int L = s.length(); int N = L * (L + 1) / 2; String[] Comb = new String[N]; for (int i = 0, p = 0; i < L; ++i) { for (int j = 0; j < (L - i); ++j) { Comb[p++] = s.substring(j, i + j + 1); } } /* * for(int j=0;j<N;++j) { System.out.println(Comb[j]); } */ boolean[] val = new boolean[N]; for (int i = 0; i < N; ++i) val[i] = true; int counter = N; int p = 0, start = 0; for (int i = 0, j; i < L; ++i) { p = L - i; for (j = start; j < (start + p); ++j) { if (val[j]) { //System.out.println(Comb[j]); for (int k = j + 1; k < start + p; ++k) { if (Comb[j].equals(Comb[k])) { counter--; val[k] = false; } } } } start = j; } System.out.println("Substrings are " + N + " of which unique substrings are " + counter); long endTime = System.currentTimeMillis(); System.out.println("It took " + (endTime - startTime) + " milliseconds"); } } 

第二个解决scheme:

 import java.util.*; public class DistictSubstrings_usingHashTable { public static void main(String args[]) { // create a hash set Scanner in = new Scanner(System.in); System.out.print("Enter The string"); String s = in.nextLine(); int L = s.length(); long startTime = System.currentTimeMillis(); Set<String> hs = new HashSet<String>(); // add elements to the hash set for (int i = 0; i < L; ++i) { for (int j = 0; j < (L - i); ++j) { hs.add(s.substring(j, i + j + 1)); } } System.out.println(hs.size()); long endTime = System.currentTimeMillis(); System.out.println("It took " + (endTime - startTime) + " milliseconds"); } } 

第三个解决scheme:

 import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.util.Arrays; public class LCPsolnFroDistinctSubString { public static void main(String[] args) throws IOException { BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); System.out.println("Enter Desired String "); String string = br.readLine(); int length = string.length(); String[] arrayString = new String[length]; for (int i = 0; i < length; ++i) { arrayString[i] = string.substring(length - 1 - i, length); } Arrays.sort(arrayString); for (int i = 0; i < length; ++i) System.out.println(arrayString[i]); long num_substring = arrayString[0].length(); for (int i = 0; i < length - 1; ++i) { int j = 0; for (; j < arrayString[i].length(); ++j) { if (!((arrayString[i].substring(0, j + 1)).equals((arrayString)[i + 1] .substring(0, j + 1)))) { break; } } num_substring += arrayString[i + 1].length() - j; } System.out.println("unique substrings = " + num_substring); } } 

对于大哦…最好你可以做的是O(n ^ 2)

没有必要重新发明轮子,它不是基于一个string,而是一套,所以你将不得不采取的概念,并将其应用于自己的情况。

algorithm

MS真的很好的白皮书

深入PowerPoint

博客上的string烫发

好吧,因为有可能有n*(n+1)/2不同的子串(空子串为+1),所以我怀疑你可能比O(n*2) (最坏的情况)。 最简单的事情就是生成它们,并使用一些漂亮的O(1)查找表(如散列表)来排除重复项,当你find它们。

 class program { List<String> lst = new List<String>(); String str = "abc"; public void func() { subset(0, ""); lst.Sort(); lst = lst.Distinct().ToList(); foreach (String item in lst) { Console.WriteLine(item); } } void subset(int n, String s) { for (int i = n; i < str.Length; i++) { lst.Add(s + str[i].ToString()); subset(i + 1, s + str[i].ToString()); } } } 
 class SubstringsOfAString { public static void main(String args[]) { String string = "Hello", sub = null; System.out.println("Substrings of \"" + string + "\" are :-"); for (int i = 0; i < string.length(); i++) { for (int j = 1; j <= string.length() - i; j++) { sub = string.substring(i, j + i); System.out.println(sub); } } } } 

它只能在o(n ^ 2)时间内完成,因为一个string的唯一子串的总数将是n(n + 1)/ 2。

例:

strings =“abcd”

通过0 :(所有string的长度都是1)

a,b,c,d = 4个string

通过1 :(所有的string的长度是2)

ab,bc,cd = 3个string

通过2 :(所有的string都是3的长度)

abc,bcd = 2个string

通过3 :(所有的string的长度是4)

abcd = 1个string

使用这个类比,我们可以写出o(n ^ 2)时间复杂度和恒定的空间复杂度的解决scheme。

源代码如下:

 #include<stdio.h> void print(char arr[], int start, int end) { int i; for(i=start;i<=end;i++) { printf("%c",arr[i]); } printf("\n"); } void substrings(char arr[], int n) { int pass,j,start,end; int no_of_strings = n-1; for(pass=0;pass<n;pass++) { start = 0; end = start+pass; for(j=no_of_strings;j>=0;j--) { print(arr,start, end); start++; end = start+pass; } no_of_strings--; } } int main() { char str[] = "abcd"; substrings(str,4); return 0; } 

这是我在Python中的代码。 它生成任何给定string的所有可能的子string。

 def find_substring(str_in): substrs = [] if len(str_in) <= 1: return [str_in] s1 = find_substring(str_in[:1]) s2 = find_substring(str_in[1:]) substrs.append(s1) substrs.append(s2) for s11 in s1: substrs.append(s11) for s21 in s2: substrs.append("%s%s" %(s11, s21)) for s21 in s2: substrs.append(s21) return set(substrs) 

如果将str_ =“abcdef”传递给该函数,则会生成以下结果:

abc abc abc abcde abcdef abcdf abce abcef abcf abd abde abdef abdf abe abef abf ac acd acde acdef acdf ace acef acf abc abc abcf abcf abcf abcf abc abcf abf abc abf abf abc abf abf abc abc abf abc abcdef abcdf abc abcf abcf abc abcf abcf abd abdef abf abf abc abf abf ac ad,adef,adf,ae,aef,af,b,bc,bcd,bcde,bcdef,bcdf,bce,bcef,bcf,bd,bde,bdef,bdf,be,bef,bf,c,cd, cde,cdef,cdf,ce,cef,cf,d,de,def,df,e,ef,f

朴素algorithm需要O(n ^ 3)时间而不是O(n ^ 2)时间。 有O(n ^ 2)个子string。 如果你把O(n ^ 2)个子string,例如set,那么设置比较每个string的O(lgn)比较,以检查它是否存在于集合中。 此外,string比较需要O(n)个时间。 因此,如果使用set,则需要O(n ^ 3 lgn)次。 如果使用hashtable而不是set,则可以减lessO(n ^ 3)时间。

重点是string比较而不是数字比较。

因此,如果使用后缀数组和最长公共前缀(LCP)algorithm,最好的algorithm之一就是减lessO(n ^ 2)时间。 使用O(n)时间algorithm构build后缀数组。 时间LCP = O(n)时间。 由于对于后缀数组中的每一对string,执行LCP,因此总时间是O(n ^ 2)时间来查找不同子string的长度

另外如果你想打印所有不同的子string,它需要O(n ^ 2)时间。

这将打印唯一的子string。 https://ideone.com/QVWOh0

 def uniq_substring(test): lista=[] [lista.append(test[i:i+k+1]) for i in range(len(test)) for k in range(len(test)-i) if test[i:i+k+1] not in lista and test[i:i+k+1][::-1] not in lista] print lista uniq_substring('rohit') uniq_substring('abab') ['r', 'ro', 'roh', 'rohi', 'rohit', 'o', 'oh', 'ohi', 'ohit', 'h', 'hi', 'hit', 'i', 'it', 't'] ['a', 'ab', 'aba', 'abab', 'b', 'bab'] 

试试这个代码使用后缀数组和最长的通用前缀。 它也可以给你唯一的子串的总数。 该代码可能在Visual Studio中产生堆栈溢出,但在Eclipse C ++中运行良好。 这是因为它返回函数的向量。 还没有对极长的string进行testing。 将这样做,并报告回来。

  // C++ program for building LCP array for given text #include <bits/stdc++.h> #include <vector> #include <string> using namespace std; #define MAX 100000 int cum[MAX]; // Structure to store information of a suffix struct suffix { int index; // To store original index int rank[2]; // To store ranks and next rank pair }; // A comparison function used by sort() to compare two suffixes // Compares two pairs, returns 1 if first pair is smaller int cmp(struct suffix a, struct suffix b) { return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0): (a.rank[0] < b.rank[0] ?1: 0); } // This is the main function that takes a string 'txt' of size n as an // argument, builds and return the suffix array for the given string vector<int> buildSuffixArray(string txt, int n) { // A structure to store suffixes and their indexes struct suffix suffixes[n]; // Store suffixes and their indexes in an array of structures. // The structure is needed to sort the suffixes alphabatically // and maintain their old indexes while sorting for (int i = 0; i < n; i++) { suffixes[i].index = i; suffixes[i].rank[0] = txt[i] - 'a'; suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1; } // Sort the suffixes using the comparison function // defined above. sort(suffixes, suffixes+n, cmp); // At his point, all suffixes are sorted according to first // 2 characters. Let us sort suffixes according to first 4 // characters, then first 8 and so on int ind[n]; // This array is needed to get the index in suffixes[] // from original index. This mapping is needed to get // next suffix. for (int k = 4; k < 2*n; k = k*2) { // Assigning rank and index values to first suffix int rank = 0; int prev_rank = suffixes[0].rank[0]; suffixes[0].rank[0] = rank; ind[suffixes[0].index] = 0; // Assigning rank to suffixes for (int i = 1; i < n; i++) { // If first rank and next ranks are same as that of previous // suffix in array, assign the same new rank to this suffix if (suffixes[i].rank[0] == prev_rank && suffixes[i].rank[1] == suffixes[i-1].rank[1]) { prev_rank = suffixes[i].rank[0]; suffixes[i].rank[0] = rank; } else // Otherwise increment rank and assign { prev_rank = suffixes[i].rank[0]; suffixes[i].rank[0] = ++rank; } ind[suffixes[i].index] = i; } // Assign next rank to every suffix for (int i = 0; i < n; i++) { int nextindex = suffixes[i].index + k/2; suffixes[i].rank[1] = (nextindex < n)? suffixes[ind[nextindex]].rank[0]: -1; } // Sort the suffixes according to first k characters sort(suffixes, suffixes+n, cmp); } // Store indexes of all sorted suffixes in the suffix array vector<int>suffixArr; for (int i = 0; i < n; i++) suffixArr.push_back(suffixes[i].index); // Return the suffix array return suffixArr; } /* To construct and return LCP */ vector<int> kasai(string txt, vector<int> suffixArr) { int n = suffixArr.size(); // To store LCP array vector<int> lcp(n, 0); // An auxiliary array to store inverse of suffix array // elements. For example if suffixArr[0] is 5, the // invSuff[5] would store 0. This is used to get next // suffix string from suffix array. vector<int> invSuff(n, 0); // Fill values in invSuff[] for (int i=0; i < n; i++) invSuff[suffixArr[i]] = i; // Initialize length of previous LCP int k = 0; // Process all suffixes one by one starting from // first suffix in txt[] for (int i=0; i<n; i++) { /* If the current suffix is at n-1, then we don't have next substring to consider. So lcp is not defined for this substring, we put zero. */ if (invSuff[i] == n-1) { k = 0; continue; } /* j contains index of the next substring to be considered to compare with the present substring, ie, next string in suffix array */ int j = suffixArr[invSuff[i]+1]; // Directly start matching from k'th index as // at-least k-1 characters will match while (i+k<n && j+k<n && txt[i+k]==txt[j+k]) k++; lcp[invSuff[i]] = k; // lcp for the present suffix. // Deleting the starting character from the string. if (k>0) k--; } // return the constructed lcp array return lcp; } // Utility function to print an array void printArr(vector<int>arr, int n) { for (int i = 0; i < n; i++) cout << arr[i] << " "; cout << endl; } // Driver program int main() { int t; cin >> t; //t = 1; while (t > 0) { //string str = "banana"; string str; cin >> str; // >> k; vector<int>suffixArr = buildSuffixArray(str, str.length()); int n = suffixArr.size(); cout << "Suffix Array : \n"; printArr(suffixArr, n); vector<int>lcp = kasai(str, suffixArr); cout << "\nLCP Array : \n"; printArr(lcp, n); // cum will hold number of substrings if that'a what you want (total = cum[n-1] cum[0] = n - suffixArr[0]; // vector <pair<int,int>> substrs[n]; int count = 1; for (int i = 1; i <= n-suffixArr[0]; i++) { //substrs[0].push_back({suffixArr[0],i}); string sub_str = str.substr(suffixArr[0],i); cout << count << " " << sub_str << endl; count++; } for(int i = 1;i < n;i++) { cum[i] = cum[i-1] + (n - suffixArr[i] - lcp[i - 1]); int end = n - suffixArr[i]; int begin = lcp[i-1] + 1; int begin_suffix = suffixArr[i]; for (int j = begin, k = 1; j <= end; j++, k++) { //substrs[i].push_back({begin_suffix, lcp[i-1] + k}); // cout << "i push " << i << " " << begin_suffix << " " << k << endl; string sub_str = str.substr(begin_suffix, lcp[i-1] +k); cout << count << " " << sub_str << endl; count++; } } /*int count = 1; cout << endl; for(int i = 0; i < n; i++){ for (auto it = substrs[i].begin(); it != substrs[i].end(); ++it ) { string sub_str = str.substr(it->first, it->second); cout << count << " " << sub_str << endl; count++; } }*/ t--; } return 0; } 

这是一个更简单的algorithm:

 #include <iostream> #include <string.h> #include <vector> #include <string> #include <algorithm> #include <time.h> using namespace std; char txt[100000], *p[100000]; int m, n; int cmp(const void *p, const void *q) { int rc = memcmp(*(char **)p, *(char **)q, m); return rc; } int main() { std::cin >> txt; int start_s = clock(); n = strlen(txt); int k; int i; int count = 1; for (m = 1; m <= n; m++) { for (k = 0; k+m <= n; k++) p[k] = txt+k; qsort(p, k, sizeof(p[0]), &cmp); for (i = 0; i < k; i++) { if (i != 0 && cmp(&p[i-1], &p[i]) == 0){ continue; } char cur_txt[100000]; memcpy(cur_txt, p[i],m); cur_txt[m] = '\0'; std::cout << count << " " << cur_txt << std::endl; count++; } } cout << --count << endl; int stop_s = clock(); float run_time = (stop_s - start_s) / double(CLOCKS_PER_SEC); cout << endl << "distinct substrings \t\tExecution time = " << run_time << " seconds" << endl; return 0; } 

两种algorithm都列出了一个非常慢的string,尽pipe如此。 我testing了一串长度超过47,000的algorithm,algorithm花费了20分钟才完成,第一个algorithm花了1200秒,第二个algorithm花了1360秒,这就是计算唯一的子串而不输出到terminal。 因此,对于可能长达1000的string,您可能会得到一个工作解决scheme。 尽pipe这两种解决scheme确实计算了独特的子串总数。 我testing了两个algorithm的string长度为2000和10,000。 第一个algorithm的时间是:0.33秒和12秒; 第二种algorithm是0.535秒和20秒。 所以看起来总的来说第一个algorithm更快。

你的程序不给予独特的sbstrins。

请inputababtesting,输出应该是aba,ba,bab,abab