如何使用PowerShell分割文本文件?

我需要将一个大的(500 MB)的文本文件(一个log4netexception文件)分成可pipe理的块,比如100个5 MB的文件就可以了。

我认为这应该是在PowerShell公园散步。 我该怎么做?

对于PowerShell来说,这是一件相当简单的事情,由于标准的Get-Content cmdlet不能很好地处理非常大的文件,所以很复杂。 我build议要做的是使用.NET StreamReader类在PowerShell脚本中逐行读取文件,并使用Add-Content cmdlet将每行写入文件名不断增加的索引文件中。 像这样的东西:

 $reader = new-object System.IO.StreamReader("C:\Exceptions.log") $count = 1 $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) while(($line = $reader.ReadLine()) -ne $null) { Add-Content -path $fileName -value $line if((Get-ChildItem -path $fileName).Length -ge $upperBound) { ++$count $fileName = "{0}{1}.{2}" -f ($rootName, $count, $ext) } } $reader.Close() 

一些现有答案的警告 – 他们将运行非常缓慢的非常大的文件。 几个小时后,我放弃了一个1.6 GB的日志文件,意识到在第二天重新开始工作之前,这个文件是不会完成的。

两个问题:调用Add-Content打开,寻找并closures源文件中每一行的当前目标文件。 每次读取一些源文件,并寻找新的线条也会减慢速度,但我的猜测是,添加内容是主要的罪魁祸首。

下面的变体会产生稍微不愉快的输出结果:它会将文件分割在行中,但是它会在不到一分钟的时间内分割我的1.6 GB日志:

 $from = "C:\temp\large_log.txt" $rootName = "C:\temp\large_log_chunk" $ext = "txt" $upperBound = 100MB $fromFile = [io.file]::OpenRead($from) $buff = new-object byte[] $upperBound $count = $idx = 0 try { do { "Reading $upperBound" $count = $fromFile.Read($buff, 0, $buff.Length) if ($count -gt 0) { $to = "{0}.{1}.{2}" -f ($rootName, $idx, $ext) $toFile = [io.file]::OpenWrite($to) try { "Writing $count to $to" $tofile.Write($buff, 0, $count) } finally { $tofile.Close() } } $idx ++ } while ($count -gt 0) } finally { $fromFile.Close() } 

根据行数(在这种情况下为100)简单的单行分隔:

 $i=0; Get-Content .....log -ReadCount 100 | %{$i++; $_ | Out-File out_$i.txt} 

和这里的所有答案一样,但是使用StreamReader / StreamWriter在新行上分行(逐行,而不是试图将整个文件一次读入内存)。 这种方法可以以我知道的最快的方式分割大文件。

注:我做了很less的错误检查,所以我不能保证它会顺利工作的情况下。 它为我做了( 在95秒内,每个文件100,000行中的400万行的1.7 GB TXT文件 )。

 #split test $sw = new-object System.Diagnostics.Stopwatch $sw.Start() $filename = "C:\Users\Vincent\Desktop\test.txt" $rootName = "C:\Users\Vincent\Desktop\result" $ext = ".txt" $linesperFile = 100000#100k $filecount = 1 $reader = $null try{ $reader = [io.file]::OpenText($filename) try{ "Creating file number $filecount" $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) $filecount++ $linecount = 0 while($reader.EndOfStream -ne $true) { "Reading $linesperFile" while( ($linecount -lt $linesperFile) -and ($reader.EndOfStream -ne $true)){ $writer.WriteLine($reader.ReadLine()); $linecount++ } if($reader.EndOfStream -ne $true) { "Closing file" $writer.Dispose(); "Creating file number $filecount" $writer = [io.file]::CreateText("{0}{1}.{2}" -f ($rootName,$filecount.ToString("000"),$ext)) $filecount++ $linecount = 0 } } } finally { $writer.Dispose(); } } finally { $reader.Dispose(); } $sw.Stop() Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

输出拆分1.7 GB文件:

 ... Creating file number 45 Reading 100000 Closing file Creating file number 46 Reading 100000 Closing file Creating file number 47 Reading 100000 Closing file Creating file number 48 Reading 100000 Split complete in 95.6308289 seconds 

我经常需要做同样的事情。 诀窍是将头重复到每个拆分块中。 我写了下面的cmdlet(PowerShell v2 CTP 3),它的确有用。

 ############################################################################## #.SYNOPSIS # Breaks a text file into multiple text files in a destination, where each # file contains a maximum number of lines. # #.DESCRIPTION # When working with files that have a header, it is often desirable to have # the header information repeated in all of the split files. Split-File # supports this functionality with the -rc (RepeatCount) parameter. # #.PARAMETER Path # Specifies the path to an item. Wildcards are permitted. # #.PARAMETER LiteralPath # Specifies the path to an item. Unlike Path, the value of LiteralPath is # used exactly as it is typed. No characters are interpreted as wildcards. # If the path includes escape characters, enclose it in single quotation marks. # Single quotation marks tell Windows PowerShell not to interpret any # characters as escape sequences. # #.PARAMETER Destination # (Or -d) The location in which to place the chunked output files. # #.PARAMETER Count # (Or -c) The maximum number of lines in each file. # #.PARAMETER RepeatCount # (Or -rc) Specifies the number of "header" lines from the input file that will # be repeated in each output file. Typically this is 0 or 1 but it can be any # number of lines. # #.EXAMPLE # Split-File bigfile.csv 3000 -rc 1 # #.LINK # Out-TempFile ############################################################################## function Split-File { [CmdletBinding(DefaultParameterSetName='Path')] param( [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$Path, [Alias("PSPath")] [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$LiteralPath, [Alias('c')] [Parameter(Position=2,Mandatory=$true)] [Int32]$Count, [Alias('d')] [Parameter(Position=3)] [String]$Destination='.', [Alias('rc')] [Parameter()] [Int32]$RepeatCount ) process { # yeah! the cmdlet supports wildcards if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } elseif ($Path) { $ResolveArgs = @{Path=$Path} } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } # get the input file in manageable chunks $Part = 1 Get-Content $_ -ReadCount:$Count | %{ # make an output filename with a suffix $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) # In the first iteration the header will be # copied to the output file as usual # on subsequent iterations we have to do it if ($RepeatCount -and $Part -gt 1) { Set-Content $OutputFile $Header } # write this chunk to the output file Write-Host "Writing $OutputFile" Add-Content $OutputFile $_ $Part += 1 } } } } 

我试图在单个vCard VCF文件中拆分多个联系人以分隔文件时发现此问题。 这是我根据李的代码所做的。 我不得不查看如何创build一个新的StreamReader对象,并将null更改为$ null。

 $reader = new-object System.IO.StreamReader("C:\Contacts.vcf") $count = 1 $filename = "C:\Contacts\{0}.vcf" -f ($count) while(($line = $reader.ReadLine()) -ne $null) { Add-Content -path $fileName -value $line if($line -eq "END:VCARD") { ++$count $filename = "C:\Contacts\{0}.vcf" -f ($count) } } $reader.Close() 

这些答案中有很多对我的源文件来说太慢了。 我的源文件是10 MB到800 MB之间的SQL文件,需要分割成大致相等的行数。

我发现以前使用Add-Content的一些答案很慢。 等待很长时间才能完成分裂并不罕见。

我没有尝试Typhlosaurus的答案 ,但它只能按文件大小进行分割,而不是按行数进行分割。

以下适合我的目的。

 $sw = new-object System.Diagnostics.Stopwatch $sw.Start() Write-Host "Reading source file..." $lines = [System.IO.File]::ReadAllLines("C:\Temp\SplitTest\source.sql") $totalLines = $lines.Length Write-Host "Total Lines :" $totalLines $skip = 0 $count = 100000; # Number of lines per file # File counter, with sort friendly name $fileNumber = 1 $fileNumberString = $filenumber.ToString("000") while ($skip -le $totalLines) { $upper = $skip + $count - 1 if ($upper -gt ($lines.Length - 1)) { $upper = $lines.Length - 1 } # Write the lines [System.IO.File]::WriteAllLines("C:\Temp\SplitTest\result$fileNumberString.txt",$lines[($skip..$upper)]) # Increment counters $skip += $count $fileNumber++ $fileNumberString = $filenumber.ToString("000") } $sw.Stop() Write-Host "Split complete in " $sw.Elapsed.TotalSeconds "seconds" 

对于一个54 MB的文件,我得到的输出…

 Reading source file... Total Lines : 910030 Split complete in 1.7056578 seconds 

我希望其他人寻找一个简单的,基于行的拆分脚本,匹配我的要求会发现这个有用的。

还有这个快速(而且有些肮脏)的单线程:

 $linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | %{ Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } } 

您可以通过更改硬编码的3000值来调整每批第一行的数量。

我已经做了一些修改,根据每个部分的大小来分割文件。

 ############################################################################## #.SYNOPSIS # Breaks a text file into multiple text files in a destination, where each # file contains a maximum number of lines. # #.DESCRIPTION # When working with files that have a header, it is often desirable to have # the header information repeated in all of the split files. Split-File # supports this functionality with the -rc (RepeatCount) parameter. # #.PARAMETER Path # Specifies the path to an item. Wildcards are permitted. # #.PARAMETER LiteralPath # Specifies the path to an item. Unlike Path, the value of LiteralPath is # used exactly as it is typed. No characters are interpreted as wildcards. # If the path includes escape characters, enclose it in single quotation marks. # Single quotation marks tell Windows PowerShell not to interpret any # characters as escape sequences. # #.PARAMETER Destination # (Or -d) The location in which to place the chunked output files. # #.PARAMETER Size # (Or -s) The maximum size of each file. Size must be expressed in MB. # #.PARAMETER RepeatCount # (Or -rc) Specifies the number of "header" lines from the input file that will # be repeated in each output file. Typically this is 0 or 1 but it can be any # number of lines. # #.EXAMPLE # Split-File bigfile.csv -s 20 -rc 1 # #.LINK # Out-TempFile ############################################################################## function Split-File { [CmdletBinding(DefaultParameterSetName='Path')] param( [Parameter(ParameterSetName='Path', Position=1, Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$Path, [Alias("PSPath")] [Parameter(ParameterSetName='LiteralPath', Mandatory=$true, ValueFromPipelineByPropertyName=$true)] [String[]]$LiteralPath, [Alias('s')] [Parameter(Position=2,Mandatory=$true)] [Int32]$Size, [Alias('d')] [Parameter(Position=3)] [String]$Destination='.', [Alias('rc')] [Parameter()] [Int32]$RepeatCount ) process { # yeah! the cmdlet supports wildcards if ($LiteralPath) { $ResolveArgs = @{LiteralPath=$LiteralPath} } elseif ($Path) { $ResolveArgs = @{Path=$Path} } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } Resolve-Path @ResolveArgs | %{ $InputName = [IO.Path]::GetFileNameWithoutExtension($_) $InputExt = [IO.Path]::GetExtension($_) if ($RepeatCount) { $Header = Get-Content $_ -TotalCount:$RepeatCount } # get the input file in manageable chunks $Part = 1 $buffer = "" Get-Content $_ -ReadCount:1 | %{ # make an output filename with a suffix $OutputFile = Join-Path $Destination ('{0}-{1:0000}{2}' -f ($InputName,$Part,$InputExt)) # In the first iteration the header will be # copied to the output file as usual # on subsequent iterations we have to do it if ($RepeatCount -and $Part -gt 1) { Set-Content $OutputFile $Header } # test buffer size and dump data only if buffer is greater than size if ($buffer.length -gt ($Size * 1MB)) { # write this chunk to the output file Write-Host "Writing $OutputFile" Add-Content $OutputFile $buffer $Part += 1 $buffer = "" } else { $buffer += $_ + "`r" } } } } } } 

做这个:

文件1

还有这个快速(而且有些肮脏)的单线程:

  $linecount=0; $i=0; Get-Content .\BIG_LOG_FILE.txt | % { Add-Content OUT$i.log "$_"; $linecount++; if ($linecount -eq 3000) {$I++; $linecount=0 } } 

您可以通过更改硬编码的3000值来调整每批第一行的数量。

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -First 5000 | out-File C:\temp\file1.txt -Encoding ASCII 

文件2

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 5000 | Select -First 5000 | out-File C:\temp\file2.txt -Encoding ASCII 

文件3

 Get-Content C:\TEMP\DATA\split\splitme.txt | Select -Skip 10000 | Select -First 5000 | out-File C:\temp\file3.txt -Encoding ASCII 

等等…

我的要求有点不同。 我经常使用逗号分隔和制表符分隔的ASCII文件,其中单行是单个数据logging。 而且它们真的很大,所以我需要将它们分成可pipe理的部分(同时保留标题行)。

所以,我回到了我的传统VBScript方法,并将一个可以在任何Windows计算机上运行的小型.vbs脚本(由Window上的WScript.exe脚本主机引擎自动执行)。

这种方法的好处是它使用了文本stream,所以底层的数据不会被加载到内存中(或者至less不是一次性的)。 其结果是它非常快速,并不需要太多的内存来运行。 在我的i7上使用这个脚本分割的testing文件大小约为1 GB,大约有1200万行文本,并被分成25个部分文件(每个文件大约500k行) – 处理大约需要2分钟,它没有超过任何时候使用的3 MB内存。

这里的警告是它依赖于具有“行”(意味着每个logging由CRLF分隔)的文本文件,因为文本stream对象使用“ReadLine”函数一次处理一行。 但是,嘿,如果你正在使用TSV或CSV文件,这是完美的。

 Option Explicit Private Const INPUT_TEXT_FILE = "c:\bigtextfile.txt" Private Const REPEAT_HEADER_ROW = True Private Const LINES_PER_PART = 500000 Dim oFileSystem, oInputFile, oOutputFile, iOutputFile, iLineCounter, sHeaderLine, sLine, sFileExt, sStart sStart = Now() sFileExt = Right(INPUT_TEXT_FILE,Len(INPUT_TEXT_FILE)-InstrRev(INPUT_TEXT_FILE,".")+1) iLineCounter = 0 iOutputFile = 1 Set oFileSystem = CreateObject("Scripting.FileSystemObject") Set oInputFile = oFileSystem.OpenTextFile(INPUT_TEXT_FILE, 1, False) Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) If REPEAT_HEADER_ROW Then iLineCounter = 1 sHeaderLine = oInputFile.ReadLine() Call oOutputFile.WriteLine(sHeaderLine) End If Do While Not oInputFile.AtEndOfStream sLine = oInputFile.ReadLine() Call oOutputFile.WriteLine(sLine) iLineCounter = iLineCounter + 1 If iLineCounter Mod LINES_PER_PART = 0 Then iOutputFile = iOutputFile + 1 Call oOutputFile.Close() Set oOutputFile = oFileSystem.OpenTextFile(Replace(INPUT_TEXT_FILE, sFileExt, "_" & iOutputFile & sFileExt), 2, True) If REPEAT_HEADER_ROW Then Call oOutputFile.WriteLine(sHeaderLine) End If End If Loop Call oInputFile.Close() Call oOutputFile.Close() Set oFileSystem = Nothing Call MsgBox("Done" & vbCrLf & "Lines Processed:" & iLineCounter & vbCrLf & "Part Files: " & iOutputFile & vbCrLf & "Start Time: " & sStart & vbCrLf & "Finish Time: " & Now()) 

听起来像是UNIX命令split的一个工作:

 split MyBigFile.csv 

只需在不到10分钟的时间内将我的55 GB csv文件分割成21k块。

不过PowerShell本身并不是这样,但是附带了,例如,git for windows软件包https://git-scm.com/download/win

由于日志中的行可以变化,所以我认为最好每个文件都采用多行。 以下代码片断在19秒(18.83秒)内处理了一个4百万行日志文件,将其分割成500,000行块:

 $sourceFile = "c:\myfolder\mylargeTextyFile.csv" $partNumber = 1 $batchSize = 500000 $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" [System.Text.Encoding]$enc = [System.Text.Encoding]::GetEncoding(65001) # utf8 this one $fs=New-Object System.IO.FileStream ($sourceFile,"OpenOrCreate", "Read", "ReadWrite",8,"None") $streamIn=New-Object System.IO.StreamReader($fs, $enc) $streamout = new-object System.IO.StreamWriter $pathAndFilename $line = $streamIn.readline() $counter = 0 while ($line -ne $null) { $streamout.writeline($line) $counter +=1 if ($counter -eq $batchsize) { $partNumber+=1 $counter =0 $streamOut.close() $pathAndFilename = "c:\myfolder\mylargeTextyFile part $partNumber file.csv" $streamout = new-object System.IO.StreamWriter $pathAndFilename } $line = $streamIn.readline() } $streamin.close() $streamout.close() 

这可以很容易地变成一个function或脚本文件的参数,使其更通用。 它使用StreamReaderStreamWriter来实现其速度和微小的内存占用

这是我的解决scheme将一个名为patch6.txt(大约32,000行)的文件分割成1000行的单独文件。 它不快,但它做的工作。

 $infile = "D:\Malcolm\Test\patch6.txt" $path = "D:\Malcolm\Test\" $lineCount = 1 $fileCount = 1 foreach ($computername in get-content $infile) { write $computername | out-file -Append $path_$fileCount".txt" $lineCount++ if ($lineCount -eq 1000) { $fileCount++ $lineCount = 1 } }