快速的Linux文件计数大量的文件

我试图找出当文件数量非常大（> 100,000）时，find特定目录中文件数量的最佳方法。

当有那么多文件时，执行“ls | wc -l”需要相当长的时间才能执行。我相信这是因为它返回所有文件的名称。我试图尽可能less的磁盘IO。

我已经尝试了一些shell和Perl脚本无济于事。有任何想法吗？

默认情况下， ls对这些名字进行sorting，如果这些名字中有很多的话，这可能需要一些时间。在所有名字被读取和sorting之前，也不会有输出。使用ls -f选项closuressorting。

 ls -f | wc -l

请注意，这也将启用-a ，所以. ， ..和其他文件开头. 将被计数。

最快的方法是一个特制的程序，像这样：

 #include <stdio.h> #include <dirent.h> int main(int argc, char *argv[]) { DIR *dir; struct dirent *ent; long count = 0; dir = opendir(argv[1]); while((ent = readdir(dir))) ++count; closedir(dir); printf("%s contains %ld files\n", argv[1], count); return 0; }

在我的testing中，不考虑caching，我每次对同一个目录进行50次这样的操作，以避免基于caching的数据歪斜，而且我大致获得了以下性能数字（在实时时钟中）：

 ls -1 | wc - 0:01.67 ls -f1 | wc - 0:00.14 find | wc - 0:00.22 dircnt | wc - 0:00.04

最后一个是dircnt ，就是从上面编译的程序。

编辑2016-09-26

由于stream行的需求，我重写了这个程序是recursion的，所以它将放入子目录并继续分别计算文件和目录。

由于很明显有些人想知道如何做到这一点，所以我在代码中有很多意见来试图明确发生了什么。我写了这个testing，并在64位Linux上进行了testing，但是它应该可以在包括Microsoft Windows在内的任何POSIX兼容系统上工作。错误报告是受欢迎的; 如果你不能在你的AIX或OS / 400上运行，我很高兴能够更新它。

正如你所看到的，它比原来的要复杂得多，至less有一个函数必须是recursion调用的，除非你希望代码变得非常复杂（例如pipe理一个子目录栈并在一个循环中处理）。由于我们必须检查文件types，所以不同的操作系统，标准库等之间的区别就起作用了，所以我编写了一个程序，试图在任何可以编译的系统上使用。

有很less的错误检查， countfunction本身并不真正报告错误。唯一能够真正失败的调用是opendir和stat （如果你不幸运，并且有一个系统， dirent已经包含文件types）。我没有偏执地检查子目录path名的总长度，但理论上，系统不应允许任何比PATH_MAX更长的path名。如果有问题，我可以解决这个问题，但这只是需要向有人学习编写C代码的更多的代码。这个程序的目的是作为一个例子，如何潜入子目录recursion。

 #include <stdio.h> #include <dirent.h> #include <string.h> #include <stdlib.h> #include <limits.h> #include <sys/stat.h> #if defined(WIN32) || defined(_WIN32) #define PATH_SEPARATOR '\\' #else #define PATH_SEPARATOR '/' #endif /* A custom structure to hold separate file and directory counts */ struct filecount { long dirs; long files; }; /* * counts the number of files and directories in the specified directory. * * path - relative pathname of a directory whose files should be counted * counts - pointer to struct containing file/dir counts */ void count(char *path, struct filecount *counts) { DIR *dir; /* dir structure we are reading */ struct dirent *ent; /* directory entry currently being processed */ char subpath[PATH_MAX]; /* buffer for building complete subdir and file names */ /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */ #if !defined ( _DIRENT_HAVE_D_TYPE ) struct stat statbuf; /* buffer for stat() info */ #endif /* fprintf(stderr, "Opening dir %s\n", path); */ dir = opendir(path); /* opendir failed... file likely doesn't exist or isn't a directory */ if(NULL == dir) { perror(path); return; } while((ent = readdir(dir))) { if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) { fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name); return; } /* Use dirent.d_type if present, otherwise use stat() */ #if defined ( _DIRENT_HAVE_D_TYPE ) /* fprintf(stderr, "Using dirent.d_type\n"); */ if(DT_DIR == ent->d_type) { #else /* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */ sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name); if(lstat(subpath, &statbuf)) { perror(subpath); return; } if(S_ISDIR(statbuf.st_mode)) { #endif /* Skip "." and ".." directory entries... they are not "real" directories */ if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) { /* fprintf(stderr, "This is %s, skipping\n", ent->d_name); */ } else { sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name); counts->dirs++; count(subpath, counts); } } else { counts->files++; } } /* fprintf(stderr, "Closing dir %s\n", path); */ closedir(dir); } int main(int argc, char *argv[]) { struct filecount counts; counts.files = 0; counts.dirs = 0; count(argv[1], &counts); /* If we found nothing, this is probably an error which has already been printed */ if(0 < counts.files || 0 < counts.dirs) { printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs); } return 0; }

编辑2017-01-17

我已经合并@FlyingCodeMonkeybuild议的两个更改：

使用lstat而不是stat 。如果您正在扫描的目录中有符号链接的目录，这将改变程序的行为。以前的行为是（链接）子目录将其文件数量添加到总数; 新的行为是链接的目录将被视为一个文件，其内容将不会被计算在内。
如果文件的path太长，则会发出错误消息，程序将暂停。

编辑2017-06-29

有了运气，这将是这个答案的最后编辑:)

我已经将这段代码复制到一个GitHub存储库中，使代码变得容易一些（而不是复制/粘贴，你可以直接下载源代码），而且任何人都可以通过提交一个pull – 从GitHub请求。

该源代码在Apache许可证2.0下可用。补丁^*欢迎！

“补丁”就是像我这样的老人叫“拉求”。

你有没有find？例如：

 find . -name "*.ext" | wc -l

发现，LS和Perltesting对40 000个文件：相同的速度（虽然我没有尝试清除caching）：

 [user@server logs]$ time find . | wc -l 42917 real 0m0.054s user 0m0.018s sys 0m0.040s [user@server logs]$ time /bin/ls -f | wc -l 42918 real 0m0.059s user 0m0.027s sys 0m0.037s

和perl opendir / readdir，同一时间：

 [user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"' 42918 real 0m0.057s user 0m0.024s sys 0m0.033s

注意：我使用/ bin / ls -f来确保绕过可能会慢一点的别名选项，而使用-f来避免文件sorting。 ls没有-f比find / perl慢两倍，除非ls和-f一起使用，似乎是同一时间：

 [user@server logs]$ time /bin/ls . | wc -l 42916 real 0m0.109s user 0m0.070s sys 0m0.044s

我也想有一些脚本来直接询问文件系统，而没有所有不必要的信息。

testing基于Peter van der Heijden，glenn jackman和mark4o的回答。

托马斯

您可以根据您的要求更改输出，但是这里是我写的recursion计数和报告一系列数字命名的目录中的文件数量的bash单线程。

 dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

这看起来recursion的给定目录中的所有文件（不是目录），并以散列格式返回结果。对find命令进行简单的调整可以使你正在寻找什么types的文件更具体，等等。

结果如下所示：

 1 => 38, 65 => 95052, 66 => 12823, 67 => 10572, 69 => 67275, 70 => 8105, 71 => 42052, 72 => 1184,

对我来说，令人惊讶的是，一个简单的发现与ls -f非常相似

 > time ls -f my_dir | wc -l 17626 real 0m0.015s user 0m0.011s sys 0m0.009s

与

 > time find my_dir -maxdepth 1 | wc -l 17625 real 0m0.014s user 0m0.008s sys 0m0.010s

当然，小数点后三位的值每次执行时都会有所变化，所以它们基本上是一样的。但是请注意， find返回一个额外的单位，因为它计算实际的目录本身（并且，如前所述， ls -f返回两个额外的单位，因为它也计数。

你可以尝试在Perl使用opendir()和readdir()更快。这些function的例子看这里

只是为了完整而添加这个。当然，正确的答案已经被别人发布了，但是你也可以用树程序来获得文件和目录的数量。

运行命令tree | tail -n 1 tree | tail -n 1得到最后一行，这将会像“763目录，9290文件”。这会recursion计算文件和文件夹，不包括隐藏文件，可以使用标志-a添加。作为参考，我的电脑上花了4.8秒，因为树算了我的整个家庭目录，这是24777目录，238680文件。 find -type f | wc -l find -type f | wc -l花了5.3秒，再花了半秒，所以我认为树在速度上是非常有竞争力的。

只要你没有任何子文件夹，树是一个快速和简单的方法来计算文件。

另外，纯粹为了它的乐趣，你可以使用tree | grep '^├' tree | grep '^├'只显示当前目录中的文件/文件夹 – 这基本上是ls的慢得多的版本。

ls花费更多的时间对文件名进行sorting，使用-f禁用sorting会节省一些时间：

 ls -f | wc -l

或者你可以使用find ：

 find . -type f | wc -l

当我试图对大约10K文件夹的数据集进行计数时，我来到了这里，每个文件夹大约有10K个文件。许多方法的问题是他们隐含的统计100M文件，这需要很长的时间。

我冒昧地扩展了christopher-schultz的方法，所以它支持通过args传递目录（他的recursion方法也使用stat）。

把下面的文件dircnt_args.c文件中：

 #include <stdio.h> #include <dirent.h> int main(int argc, char *argv[]) { DIR *dir; struct dirent *ent; long count; long countsum = 0; int i; for(i=1; i < argc; i++) { dir = opendir(argv[i]); count = 0; while((ent = readdir(dir))) ++count; closedir(dir); printf("%s contains %ld files\n", argv[i], count); countsum += count; } printf("sum: %ld\n", countsum); return 0; }

在gcc -o dircnt_args dircnt_args.c你可以像这样调用它：

 dircnt_args /your/dirs/*

在10K文件夹中的100M文件上述完成相当快（大约5分钟的第一次运行，后续的高速caching：〜23秒）。

在不到一个小时内完成的唯一的其他方法是在caching上大约1分钟的ls -f /your/dirs/* | wc -l ： ls -f /your/dirs/* | wc -l ls -f /your/dirs/* | wc -l 。计数是由每个目录几个换行符，虽然…

除了预期之外，我在一个小时内都没有find答案。 – /

因为我没有足够的声望评论一个答案，所以在这里写下来，但是我被允许留下自己的答案，这是没有意义的。无论如何…

关于Christopher Schultz的答案，我build议将stat改为lstat，并且可能会添加一个边界检查以避免缓冲区溢出：

 if (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name) > PATH_MAX) { fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + strlen(PATH_SEPARATOR) + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name); return; }

使用lstat的build议是为了避免以下符号链接，如果目录包含到父目录的符号链接，则可能导致循环。

前10名直接与最高的文件没有。

dir = /; 我为$（ls -1 $ {dir} | sort -n）; {echo“$（find $ {dir} $ {i} -type f | wc -l）=> $ i，”; } | sort -nr | 头-10

这个答案在这个页面上比几乎所有的东西都要快很多，非常大，嵌套很多的目录：

https://serverfault.com/a/691372/84703

locate -r '.' | grep -c "^$PWD"

我意识到，当你有大量数据的时候，不要在内存处理中使用这个命令。所以我把结果保存到一个文件并在分析之后

 ls -1 /path/to/dir > count.txt && cat count.txt | wc -l

在linux上最快的方式（这个问题被标记为linux）是使用直接的系统调用。这里有一个小程序来统计目录中的文件（只有，没有目录）。你可以统计数百万个文件，比“ls -f”快大约2.5倍，比Christopher Schultz的答案快1.3-1.5倍。

 #define _GNU_SOURCE #include <dirent.h> #include <stdio.h> #include <fcntl.h> #include <stdlib.h> #include <sys/syscall.h> #define BUF_SIZE 4096 struct linux_dirent { long d_ino; off_t d_off; unsigned short d_reclen; char d_name[]; }; int countDir(char *dir) { int fd, nread, bpos, numFiles = 0; char d_type, buf[BUF_SIZE]; struct linux_dirent *dirEntry; fd = open(dir, O_RDONLY | O_DIRECTORY); if (fd == -1) { puts("open directory error"); exit(3); } while (1) { nread = syscall(SYS_getdents, fd, buf, BUF_SIZE); if (nread == -1) { puts("getdents error"); exit(1); } if (nread == 0) { break; } for (bpos = 0; bpos < nread;) { dirEntry = (struct linux_dirent *) (buf + bpos); d_type = *(buf + bpos + dirEntry->d_reclen - 1); if (d_type == DT_REG) { // Increase counter numFiles++; } bpos += dirEntry->d_reclen; } } close(fd); return numFiles; } int main(int argc, char **argv) { if (argc != 2) { puts("Pass directory as parameter"); return 2; } printf("Number of files in %s: %d\n", argv[1], countDir(argv[1])); return 0; }

PS：它不是recursion的，但你可以修改它来实现这一点。