我可以导入CSV文件并自动推断分隔符吗?

我想导入两种CSV文件,一些使用“;” 分隔符和其他使用“,”。 到目前为止,我一直在切换下两行:

reader=csv.reader(f,delimiter=';') 

要么

 reader=csv.reader(f,delimiter=',') 

是否可以不指定分隔符并让程序检查正确的分隔符?

下面的解决scheme(Blender和sharth)似乎适用于逗号分隔的文件(使用Libroffice生成),但不适用于使用分号分隔的文件(使用MS Office生成)。 以下是一个以分号分隔的文件的第一行:

 ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 

csv模块似乎build议使用csv嗅探器来解决这个问题。

他们给出了下面的例子,我已经适应你的情况。

 with open('example.csv', 'rb') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,") csvfile.seek(0) reader = csv.reader(csvfile, dialect) # ... process CSV file contents here ... 

让我们试试看。

 [9:13am][wlynch@watermelon /tmp] cat example #!/usr/bin/env python import csv def parse(filename): with open(filename, 'rb') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') csvfile.seek(0) reader = csv.reader(csvfile, dialect) for line in reader: print line def main(): print 'Comma Version:' parse('comma_separated.csv') print print 'Semicolon Version:' parse('semicolon_separated.csv') print print 'An example from the question (kingdom.csv)' parse('kingdom.csv') if __name__ == '__main__': main() 

和我们的样本input

 [9:13am][wlynch@watermelon /tmp] cat comma_separated.csv test,box,foo round,the,bend [9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv round;the;bend who;are;you [9:22am][wlynch@watermelon /tmp] cat kingdom.csv ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes 1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document 

如果我们执行示例程序:

 [9:14am][wlynch@watermelon /tmp] ./example Comma Version: ['test', 'box', 'foo'] ['round', 'the', 'bend'] Semicolon Version: ['round', 'the', 'bend'] ['who', 'are', 'you'] An example from the question (kingdom.csv) ['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes'] ['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] ['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document'] 

这也可能值得注意我使用的是什么版本的Python。

 [9:20am][wlynch@watermelon /tmp] python -V Python 2.7.2 

给定一个处理两个(逗号)和|的项目 (垂直条)分隔的CSV文件,这是格式正确,我尝试了以下(在https://docs.python.org/2/library/csv.html#csv.Sniffer给出):;

 dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|') 

但是,在| -delimited文件上,返回了“无法确定分隔符”exception。 推测嗅探启发式可能效果最好,如果每行有相同数量的分隔符(不包括引号中可能包含的任何内容),似乎是合理的。 所以,我没有读取文件的前1024个字节,而是试着读完前两行:

 temp_lines = csvfile.readline() + '\n' + csvfile.readline() dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|') 

到目前为止,这对我来说很好。

为了解决这个问题,我创build了一个读取文件(头文件)的第一行并检测分隔符的函数。

 def detectDelimiter(csvFile): with open(csvFile, 'r') as myCsvfile: header=myCsvfile.readline() if header.find(";")!=-1: return ";" if header.find(",")!=-1: return "," #default delimiter (MS Office export) return ";" 

如果你使用DictReader你可以这样做:

 #!/usr/bin/env python import csv def parse(filename): with open(filename, 'rb') as csvfile: dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,') csvfile.seek(0) reader = csv.DictReader(csvfile, dialect=dialect) for line in reader: print(line['ReleveAnnee']) 

我在Python 3.5使用了它,并且这样工作。

我不认为可以有一个完美的通用解决scheme(我可以使用的一个原因,因为分隔符是我的一些数据字段需要能够包括; …)。 一个简单的启发式的决定可能是简单地阅读第一行(或更多),计算多less,; 它包含的字符(可能忽略引号内的那些字符,如果是正确和一致地创build.csv文件引号条目),并且猜测这两个字符中更频繁的是正确的分隔符。