Python:比较两个CSV文件并search相似的项目

所以我有两个CSV文件,我试图比较,并得到相似的项目的结果。 第一个文件hosts.csv如下所示:

Path Filename Size Signature C:\ a.txt 14kb 012345 D:\ b.txt 99kb 678910 C:\ c.txt 44kb 111213 

第二个文件masterlist.csv如下所示:

 Filename Signature b.txt 678910 x.txt 111213 b.txt 777777 c.txt 999999 

正如你所看到的,行不匹配,masterlist.csv总是大于hosts.csv文件。 我想要search的唯一部分是签名部分。 我知道这看起来像这样:

 hosts[3] == masterlist[1] 

我正在寻找一个解决scheme,会给我像下面的东西(基本上hosts.csv文件与一个新的结果列):

 Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2) 

我已经search了这些post,发现类似于这里的东西,但我不太明白,因为我还在学习python。

编辑使用Python 2.6

编辑:虽然我的解决scheme正常工作,请检查下面的Martijn的答案更有效的解决scheme。

你可以在这里findpython CSV模块的文档。

你在找什么是这样的:

 import csv f1 = file('hosts.csv', 'r') f2 = file('masterlist.csv', 'r') f3 = file('results.csv', 'w') c1 = csv.reader(f1) c2 = csv.reader(f2) c3 = csv.writer(f3) masterlist = list(c2) for hosts_row in c1: row = 1 found = False for master_row in masterlist: results_row = hosts_row if hosts_row[3] == master_row[1]: results_row.append('FOUND in master list (row ' + str(row) + ')') found = True break row = row + 1 if not found: results_row.append('NOT FOUND in master list') c3.writerow(results_row) f1.close() f2.close() f3.close() 

srgerg的答案是非常低效的,因为它运行在二次时间。 这里是一个线性时间解决scheme,使用Python 2.6兼容的语法:

 import csv with open('masterlist.csv', 'rb') as master: master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master))) with open('hosts.csv', 'rb') as hosts: with open('results.csv', 'wb') as results: reader = csv.reader(hosts) writer = csv.writer(results) writer.writerow(next(reader, []) + ['RESULTS']) for row in reader: index = master_indices.get(row[3]) if index is not None: message = 'FOUND in master list (row {})'.format(index) else: message = 'NOT FOUND in master list' writer.writerow(row + [message]) 

这将生成一个字典,首先将masterlist.csv中的签名映射到行号。 字典中的查找需要一定的时间,使hosts.csv行上的第二个循环与hosts.csv行数masterlist.csv 。 更不用说代码更简单了。

Python的CSV和集合模块,特别是OrderedDict ,在这里确实很有帮助。 你想使用OrderedDict来保存键的顺序等,你不必,但它是有用的!

 import csv from collections import OrderedDict signature_row_map = OrderedDict() with open('hosts.csv') as file_object: for line in csv.DictReader(file_object, delimiter='\t'): signature_row_map[line['Signature']] = {'line': line, 'found_at': None} with open('masterlist.csv') as file_object: for i, line in enumerate(csv.DictReader(file_object, delimiter='\t'), 1): if line['Signature'] in signature_row_map: signature_row_map[line['Signature']]['found_at'] = i with open('newhosts.csv', 'w') as file_object: fieldnames = ['Path', 'Filename', 'Size', 'Signature', 'RESULTS'] writer = csv.DictWriter(file_object, fieldnames, delimiter='\t') writer.writer.writerow(fieldnames) for signature_info in signature_row_map.itervalues(): result = '{0} FOUND in masterlist {1}' # explicit check for sentinel if signature_info['found_at'] is not None: result = result.format('', '(row %s)' % signature_info['found_at']) else: result = result.format('NOT', '') payload = signature_info['line'] payload['RESULTS'] = result writer.writerow(payload) 

以下是使用testingCSV文件的输出:

 Path Filename Size Signature RESULTS C:\ a.txt 14kb 012345 NOT FOUND in masterlist D:\ b.txt 99kb 678910 FOUND in masterlist (row 1) C:\ c.txt 44kb 111213 FOUND in masterlist (row 2) 

请原谅错位,他们是分开的标签:)

csv模块在分析csv文件时非常方便。 但是为了好玩,我只是简单地将input分割为空白来获取数据。

只需parsing数据,为masterlist.csv中的数据构build一个dict ,其签名为key,行号为value。 现在,对hosts.csv的每一行,我们可以查询dict并找出是否有一个相应的条目存在于masterlist.csv,如果是的话在哪一行。

 #! /usr/bin/env python def read_data(filename): input_source=open(filename,'r') input_source.readline() return [line.split() for line in input_source] if __name__=='__main__': hosts=read_data('hosts.csv') masterlist=read_data('masterlist.csv') master=dict() for index,data in enumerate(masterlist): master[data[-1]]=index+1 for row in hosts: try: found="FOUND in masterlist (row %s)"%master[row[-1]] except KeyError: found="NOT FOUND in masterlist" line=row+[found] print "%s %s %s %s %s"%tuple(line)