CSVparsing选项与.NET

我正在查看基于一般MS堆栈和.net的分隔文件（例如CSV，制表符分隔等）parsing选项。我唯一的技术是SSIS，因为我已经知道它不能满足我的需求。

所以我的select似乎是：

Regex.Split
TextFieldParser
OLEDB CSVparsing器

我有两个标准，我必须满足。首先，给出以下包含两个逻辑数据行的文件（以及总共五个物理行）：

101, Bob, "Keeps his house ""clean"". Needs to work on laundry." 102, Amy, "Brilliant. Driven. Diligent."

parsing的结果必须产生两个逻辑“行”，每个由三个string（或列）组成。第三行/列string必须保留换行符！换句话说，由于“未closures”的文本限定符，parsing器必须识别行何时“继续”到下一个物理行。

第二个标准是分隔符和文本限定符必须是可configuration的，每个文件。这里有两个string，取自不同的文件，我必须能够parsing：

 var first = @"""This"",""Is,A,Record"",""That """"Cannot"""", they say,"","""",,""be"",rightly,""parsed"",at all"; var second = @"~This~|~Is|A|Record~|~ThatCannot~|~be~|~parsed~|at all";

string“first”的正确parsing是：

这个
是，A，logging
他们说，“不能”
_
_
是
正当地
parsing的
在所有

'_'仅仅意味着空白被捕获 – 我不想要一个文字下划线出现。

可以对平面文件进行parsing的一个重要假设是：每个文件将有固定数量的列。

现在进入技术选项。

正则expression式

首先，许多响应者评论说，正则expression式“不是实现这个目标的最好方法”。不过，我确实find了一个提供优秀CSV正则expression式的评论者：

 var regex = @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"; var Regex.Split(first, regex).Dump();

适用于string“第一”的结果非常精彩：

“这个”
“是，A，logging”
他们说：“那个”“不行”，“
“”
_
“是”
正当地
“parsing”
在所有

如果清理了报价，这将是很好的，但我可以很容易地将其作为后处理步骤来处理。否则，可以使用这种方法来parsing样本string“first”和“second”，只要相应地修改了正则expression式和pipe道符号的正则expression式。优秀！

但是真正的问题与多线标准有关。在将正则expression式应用于string之前，我必须从文件中读取完整的逻辑“行”。不幸的是，我不知道要读取多less物理行来完成逻辑行，除非我有一个正则expression式/状态机。

所以这变成了“鸡与蛋”的问题。我最好的select是将整个文件作为一个巨大的string读入内存，让正则expression式分出多行（我没有检查上述正则expression式是否可以处理）。如果我有一个10吉的文件，这可能有点岌岌可危。

在下一个选项。

TextFieldParser

三行代码将使这个选项的问题变得明显：

 var reader = new Microsoft.VisualBasic.FileIO.TextFieldParser(stream); reader.Delimiters = new string[] { @"|" }; reader.HasFieldsEnclosedInQuotes = true;

分隔符configuration看起来不错。但是，“HasFieldsEnclosedInQuotes”是“游戏结束”。我惊呆了，分隔符是任意可configuration的，但相比之下，我除了引用外没有其他限定符选项。请记住，我需要configuration文本限定符。所以再说一遍，除非有人知道TextFieldParserconfiguration技巧，这是游戏结束。

OLEDB

一位同事告诉我这个select有两个主要的缺陷。首先，它对于大型文件（例如10G）具有可怕的performance。其次，所以我被告知，它猜测input数据的数据types，而不是让你指定。不好。

帮帮我

所以我想知道我错误的事实（如果有的话），以及我错过的其他选项。也许有人知道一种方法来陪审员–TextFieldParser使用任意的分隔符。也许OLEDB已经解决了陈述的问题（或者从未有过？）。

你说什么？

您是否尝试search已有的.NET CSVparsing器？这个声称比OLEDB更快地处理多行logging。

我把它写成了一个轻量级的，独立的CSVparsing器。我相信它满足您的所有要求。试试看，它可能不是防弹的。

如果它适合您，请随意更改名称空间并使用而不受限制。

 namespace NFC.Portability { using System; using System.Collections.Generic; using System.Data; using System.IO; using System.Linq; using System.Text; /// <summary> /// Loads and reads a file with comma-separated values into a tabular format. /// </summary> /// <remarks> /// Parsing assumes that the first line will always contain headers and that values will be double-quoted to escape double quotes and commas. /// </remarks> public unsafe class CsvReader { private const char SEGMENT_DELIMITER = ','; private const char DOUBLE_QUOTE = '"'; private const char CARRIAGE_RETURN = '\r'; private const char NEW_LINE = '\n'; private DataTable _table = new DataTable(); /// <summary> /// Gets the data contained by the instance in a tabular format. /// </summary> public DataTable Table { get { // validation logic could be added here to ensure that the object isn't in an invalid state return _table; } } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="path">The fully-qualified path to the file from which the instance will be populated.</param> public CsvReader( string path ) { if( path == null ) { throw new ArgumentNullException( "path" ); } FileStream fs = new FileStream( path, FileMode.Open ); Read( fs ); } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="stream">The stream from which the instance will be populated.</param> public CsvReader( Stream stream ) { if( stream == null ) { throw new ArgumentNullException( "stream" ); } Read( stream ); } /// <summary> /// Creates a new instance of <c>CsvReader</c>. /// </summary> /// <param name="bytes">The array of bytes from which the instance will be populated.</param> public CsvReader( byte[] bytes ) { if( bytes == null ) { throw new ArgumentNullException( "bytes" ); } MemoryStream ms = new MemoryStream(); ms.Write( bytes, 0, bytes.Length ); ms.Position = 0; Read( ms ); } private void Read( Stream s ) { string lines; using( StreamReader sr = new StreamReader( s ) ) { lines = sr.ReadToEnd(); } if( string.IsNullOrWhiteSpace( lines ) ) { throw new InvalidOperationException( "Data source cannot be empty." ); } bool inQuotes = false; int lineNumber = 0; StringBuilder buffer = new StringBuilder( 128 ); List<string> values = new List<string>(); Action endSegment = () => { values.Add( buffer.ToString() ); buffer.Clear(); }; Action endLine = () => { if( lineNumber == 0 ) { CreateColumns( values ); values.Clear(); } else { CreateRow( values ); values.Clear(); } values.Clear(); lineNumber++; }; fixed( char* pStart = lines ) { char* pChar = pStart; char* pEnd = pStart + lines.Length; while( pChar < pEnd ) // leave null terminator out { if( *pChar == DOUBLE_QUOTE ) { if( inQuotes ) { if( Peek( pChar, pEnd ) == SEGMENT_DELIMITER ) { endSegment(); pChar++; } else if( !ApproachingNewLine( pChar, pEnd ) ) { buffer.Append( DOUBLE_QUOTE ); } } inQuotes = !inQuotes; } else if( *pChar == SEGMENT_DELIMITER ) { if( !inQuotes ) { endSegment(); } else { buffer.Append( SEGMENT_DELIMITER ); } } else if( AtNewLine( pChar, pEnd ) ) { if( !inQuotes ) { endSegment(); endLine(); pChar++; } else { buffer.Append( *pChar ); } } else { buffer.Append( *pChar ); } pChar++; } } // append trailing values at the end of the file if( values.Count > 0 ) { endSegment(); endLine(); } } /// <summary> /// Returns the next character in the sequence but does not advance the pointer. Checks bounds. /// </summary> /// <param name="pChar">Pointer to current character.</param> /// <param name="pEnd">End of range to check.</param> /// <returns> /// Returns the next character in the sequence, or char.MinValue if range is exceeded. /// </returns> private char Peek( char* pChar, char* pEnd ) { if( pChar < pEnd ) { return *( pChar + 1 ); } return char.MinValue; } /// <summary> /// Determines if the current character represents a newline. This includes lookahead for two character newline delimiters. /// </summary> /// <param name="pChar"></param> /// <param name="pEnd"></param> /// <returns></returns> private bool AtNewLine( char* pChar, char* pEnd ) { if( *pChar == NEW_LINE ) { return true; } if( *pChar == CARRIAGE_RETURN && Peek( pChar, pEnd ) == NEW_LINE ) { return true; } return false; } /// <summary> /// Determines if the next character represents a newline, or the start of a newline. /// </summary> /// <param name="pChar"></param> /// <param name="pEnd"></param> /// <returns></returns> private bool ApproachingNewLine( char* pChar, char* pEnd ) { if( Peek( pChar, pEnd ) == CARRIAGE_RETURN || Peek( pChar, pEnd ) == NEW_LINE ) { // technically this cheats a little to avoid a two char peek by only checking for a carriage return or new line, not both in sequence return true; } return false; } private void CreateColumns( List<string> columns ) { foreach( string column in columns ) { DataColumn dc = new DataColumn( column ); _table.Columns.Add( dc ); } } private void CreateRow( List<string> values ) { if( values.Where( (o) => !string.IsNullOrWhiteSpace( o ) ).Count() == 0 ) { return; // ignore rows which have no content } DataRow dr = _table.NewRow(); _table.Rows.Add( dr ); for( int i = 0; i < values.Count; i++ ) { dr[i] = values[i]; } } } }

看看我发布到这个问题的代码：

https://stackoverflow.com/a/1544743/3043

CSVparsing选项与.NET

什么时候在C＃unit testing中使用模拟与伪装？

将过期或caching控制标题添加到IIS中的静态内容

是否有必要手动closures和处理SqlDataReader？

.NET中的事件签名 – 使用强types的“发件人”？

最大数组长度配额

如何打印出树状结构？

处理.NET IDisposable对象

如何改变变音符号为非变音符号

“Col1，Col2sorting”使用entity framework

如何创build/编辑清单文件？