哪个提交有这个blob?

给定一个blob的散列,有没有办法获得这个blob在树中的提交列表?

以下两个脚本都将blob的SHA1作为第一个参数,并在其后面,可选地,任何参数, git log将理解。 例如 – 所有的分支search,而不是只search当前的分支,或者-g在reflog中search,或者search其他任何你喜欢的东西。

在这里它是一个shell脚本 – 简短而又甜蜜,但速度很慢:

 #!/bin/sh obj_name="$1" shift git log "$@" --pretty=format:'%T %h %s' \ | while read tree commit subject ; do if git ls-tree -r $tree | grep -q "$obj_name" ; then echo $commit "$subject" fi done 

而在Perl中的优化版本,仍然很短,但更快:

 #!/usr/bin/perl use 5.008; use strict; use Memoize; my $obj_name; sub check_tree { my ( $tree ) = @_; my @subtree; { open my $ls_tree, '-|', git => 'ls-tree' => $tree or die "Couldn't open pipe to git-ls-tree: $!\n"; while ( <$ls_tree> ) { /\A[0-7]{6} (\S+) (\S+)/ or die "unexpected git-ls-tree output"; return 1 if $2 eq $obj_name; push @subtree, $2 if $1 eq 'tree'; } } check_tree( $_ ) && return 1 for @subtree; return; } memoize 'check_tree'; die "usage: git-find-blob <blob> [<git-log arguments ...>]\n" if not @ARGV; my $obj_short = shift @ARGV; $obj_name = do { local $ENV{'OBJ_NAME'} = $obj_short; `git rev-parse --verify \$OBJ_NAME`; } or die "Couldn't parse $obj_short: $!\n"; chomp $obj_name; open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s' or die "Couldn't open pipe to git-log: $!\n"; while ( <$log> ) { chomp; my ( $tree, $commit, $subject ) = split " ", $_, 3; print "$commit $subject\n" if check_tree( $tree ); } 

不幸的是脚本对我来说有点慢,所以我不得不优化一下。 幸运的是,我不仅有散列,还有文件的path。

 git log --all --pretty=format:%H <path> | xargs -n1 -I% sh -c "git ls-tree % <path> | grep -q <hash> && echo %" 

我认为这将是一个普遍有用的东西,所以我写了一个小Perl脚本来做到这一点:

 #!/usr/bin/perl -w use strict; my @commits; my %trees; my $blob; sub blob_in_tree { my $tree = $_[0]; if (defined $trees{$tree}) { return $trees{$tree}; } my $r = 0; open(my $f, "git cat-file -p $tree|") or die $!; while (<$f>) { if (/^\d+ blob (\w+)/ && $1 eq $blob) { $r = 1; } elsif (/^\d+ tree (\w+)/) { $r = blob_in_tree($1); } last if $r; } close($f); $trees{$tree} = $r; return $r; } sub handle_commit { my $commit = $_[0]; open(my $f, "git cat-file commit $commit|") or die $!; my $tree = <$f>; die unless $tree =~ /^tree (\w+)$/; if (blob_in_tree($1)) { print "$commit\n"; } while (1) { my $parent = <$f>; last unless $parent =~ /^parent (\w+)$/; push @commits, $1; } close($f); } if (!@ARGV) { print STDERR "Usage: git-find-blob blob [head ...]\n"; exit 1; } $blob = $ARGV[0]; if (@ARGV > 1) { foreach (@ARGV) { handle_commit($_); } } else { handle_commit("HEAD"); } while (@commits) { handle_commit(pop @commits); } 

今天晚上我回家时,我会把它放在github上。

更新:看起来有人已经这样做了 。 那个使用相同的总体思想,但细节是不同的,实现短得多。 我不知道哪个会更快,但是这里的performance可能不是问题!

更新2:对于它的价值,我的实现速度要快几个数量级,特别是对于一个大的存储库。 那个git ls-tree -r真的很痛。

更新3:我应该注意到,我上面的性能评论适用于我在上面第一个更新中链接的实现。 亚里士多德的执行与我的相当。 对于那些好奇的人来说更多的细节。

虽然原来的问题没有要求,我认为这也是有用的,以检查暂存区域,看是否引用一个blob。 我修改了原来的bash脚本来做到这一点,并发现在我的存储库中引用了一个损坏的blob:

 #!/bin/sh obj_name="$1" shift git ls-files --stage \ | if grep -q "$obj_name"; then echo Found in staging area. Run git ls-files --stage to see. fi git log "$@" --pretty=format:'%T %h %s' \ | while read tree commit subject ; do if git ls-tree -r $tree | grep -q "$obj_name" ; then echo $commit "$subject" fi done 

下面是一个作为对类似问题的答案而被抛光的脚本的细节 ,在这里你可以看到它的行动:

git-ls-dir的屏幕截图computing/git-ls-dir.png

所以…我需要find超过8GB大小的回购超过给定的限制的所有文件,超过108,000修订。 我修改了亚里士多德的perl脚本以及我写的一个ruby脚本来达到这个完整的解决scheme。

首先, git gc – 这样做是为了确保所有对象都在packfiles中 – 我们不扫描不在pack文件中的对象。

接下来运行这个脚本来查找CUTOFF_SIZE字节上的所有blob。 将输出捕获到像“large-blobs.log”这样的文件

 #!/usr/bin/env ruby require 'log4r' # The output of git verify-pack -v is: # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1 # # GIT_PACKS_RELATIVE_PATH=File.join('.git', 'objects', 'pack', '*.pack') # 10MB cutoff CUTOFF_SIZE=1024*1024*10 #CUTOFF_SIZE=1024 begin include Log4r log = Logger.new 'git-find-large-objects' log.level = INFO log.outputters = Outputter.stdout git_dir = %x[ git rev-parse --show-toplevel ].chomp if git_dir.empty? log.fatal "ERROR: must be run in a git repository" exit 1 end log.debug "Git Dir: '#{git_dir}'" pack_files = Dir[File.join(git_dir, GIT_PACKS_RELATIVE_PATH)] log.debug "Git Packs: #{pack_files.to_s}" # For details on this IO, see http://stackoverflow.com/questions/1154846/continuously-read-from-stdout-of-external-process-in-ruby # # Short version is, git verify-pack flushes buffers only on line endings, so # this works, if it didn't, then we could get partial lines and be sad. types = { :blob => 1, :tree => 1, :commit => 1, } total_count = 0 counted_objects = 0 large_objects = [] IO.popen("git verify-pack -v -- #{pack_files.join(" ")}") do |pipe| pipe.each do |line| # The output of git verify-pack -v is: # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1 data = line.chomp.split(' ') # types are blob, tree, or commit # we ignore other lines by looking for that next unless types[data[1].to_sym] == 1 log.info "INPUT_THREAD: Processing object #{data[0]} type #{data[1]} size #{data[2]}" hash = { :sha1 => data[0], :type => data[1], :size => data[2].to_i, } total_count += hash[:size] counted_objects += 1 if hash[:size] > CUTOFF_SIZE large_objects.push hash end end end log.info "Input complete" log.info "Counted #{counted_objects} totalling #{total_count} bytes." log.info "Sorting" large_objects.sort! { |a,b| b[:size] <=> a[:size] } log.info "Sorting complete" large_objects.each do |obj| log.info "#{obj[:sha1]} #{obj[:type]} #{obj[:size]}" end exit 0 end 

接下来,编辑文件以删除任何不等待的blob,并删除顶部的INPUT_THREAD位。 一旦你只有你想find的sha1s行,运行如下脚本:

 cat edited-large-files.log | cut -d' ' -f4 | xargs git-find-blob | tee large-file-paths.log 

下面是git-find-blob脚本。

 #!/usr/bin/perl # taken from: http://stackoverflow.com/questions/223678/which-commit-has-this-blob # and modified by Carl Myers <cmyers@cmyers.org> to scan multiple blobs at once # Also, modified to keep the discovered filenames # vi: ft=perl use 5.008; use strict; use Memoize; use Data::Dumper; my $BLOBS = {}; MAIN: { memoize 'check_tree'; die "usage: git-find-blob <blob1> <blob2> ... -- [<git-log arguments ...>]\n" if not @ARGV; while ( @ARGV && $ARGV[0] ne '--' ) { my $arg = $ARGV[0]; #print "Processing argument $arg\n"; open my $rev_parse, '-|', git => 'rev-parse' => '--verify', $arg or die "Couldn't open pipe to git-rev-parse: $!\n"; my $obj_name = <$rev_parse>; close $rev_parse or die "Couldn't expand passed blob.\n"; chomp $obj_name; #$obj_name eq $ARGV[0] or print "($ARGV[0] expands to $obj_name)\n"; print "($arg expands to $obj_name)\n"; $BLOBS->{$obj_name} = $arg; shift @ARGV; } shift @ARGV; # drop the -- if present #print "BLOBS: " . Dumper($BLOBS) . "\n"; foreach my $blob ( keys %{$BLOBS} ) { #print "Printing results for blob $blob:\n"; open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %s' or die "Couldn't open pipe to git-log: $!\n"; while ( <$log> ) { chomp; my ( $tree, $commit, $subject ) = split " ", $_, 3; #print "Checking tree $tree\n"; my $results = check_tree( $tree ); #print "RESULTS: " . Dumper($results); if (%{$results}) { print "$commit $subject\n"; foreach my $blob ( keys %{$results} ) { print "\t" . (join ", ", @{$results->{$blob}}) . "\n"; } } } } } sub check_tree { my ( $tree ) = @_; #print "Calculating hits for tree $tree\n"; my @subtree; # results = { BLOB => [ FILENAME1 ] } my $results = {}; { open my $ls_tree, '-|', git => 'ls-tree' => $tree or die "Couldn't open pipe to git-ls-tree: $!\n"; # example git ls-tree output: # 100644 blob 15d408e386400ee58e8695417fbe0f858f3ed424 filaname.txt while ( <$ls_tree> ) { /\A[0-7]{6} (\S+) (\S+)\s+(.*)/ or die "unexpected git-ls-tree output"; #print "Scanning line '$_' tree $2 file $3\n"; foreach my $blob ( keys %{$BLOBS} ) { if ( $2 eq $blob ) { print "Found $blob in $tree:$3\n"; push @{$results->{$blob}}, $3; } } push @subtree, [$2, $3] if $1 eq 'tree'; } } foreach my $st ( @subtree ) { # $st->[0] is tree, $st->[1] is dirname my $st_result = check_tree( $st->[0] ); foreach my $blob ( keys %{$st_result} ) { foreach my $filename ( @{$st_result->{$blob}} ) { my $path = $st->[1] . '/' . $filename; #print "Generating subdir path $path\n"; push @{$results->{$blob}}, $path; } } } #print "Returning results for tree $tree: " . Dumper($results) . "\n\n"; return $results; } 

输出结果如下所示:

 <hash prefix> <oneline log message> path/to/file.txt path/to/file2.txt ... <hash prefix2> <oneline log msg...> 

等等。 每个包含树中大文件的提交都会被列出。 如果你grep出了以标签开始的行,并且uniq ,那么你将得到一个你可以过滤分支的所有path的列表,或者你可以做更复杂的事情。

让我重申:这个过程成功地运行了10GB的存储库,提交了10万8次。 我花了很长的时间比我预测的时候运行大量的斑点,但是,超过10个小时,我将不得不看看是否记忆位工作…