删除MySQL中的重复行

我有一个表格,内容如下:

id (Unique) url (Unique) title company site_id 

现在,我需要删除具有相同标题,公司和site_id的行。 一种方法是使用下面的SQL和一个脚本(PHP):

 SELECT title, site_id, location, id, count( * ) FROM jobs GROUP BY site_id, company, title, location HAVING count( * ) >1 

运行此查询后,我可以使用服务器端脚本删除重复项。 但是,我想知道这是否只能使用SQL查询完成。

一个真正简单的方法是在3列上添加一个UNIQUE索引。 当您编写ALTER语句时,请包含IGNORE关键字。 像这样:

 ALTER IGNORE TABLE jobs ADD UNIQUE INDEX idx_name (site_id, title, company); 

这将删除所有重复的行。 作为一个额外的好处,未来的重复INSERTs会出错。 与往常一样,您可能需要在执行此类操作之前进行备份…

如果您不想更改列属性,那么您可以使用下面的查询。

由于您有一个具有唯一ID的列或具有auto_increment属性的列,您可以使用该列删除重复项。

 DELETE a FROM jobs as a, jobs as b WHERE (a.title = b.title OR a.title IS NULL AND b.title IS NULL) AND (a.company = b.company OR a.company IS NULL AND b.company IS NULL) AND (a.site_id = b.site_id OR a.site_id IS NULL AND b.site_id IS NULL) AND a.ID < b.ID; 

理想情况下,有许多不同的方法,最好的办法是套用你的表/列属性。

MySQL有关于引用你正在删除的表的限制。 你可以用临时表来解决这个问题,比如:

 create temporary table tmpTable (id int); insert tmpTable (id) select id from YourTable yt where exists ( select * from YourTabe yt2 where yt2.title = yt.title and yt2.company = yt.company and yt2.site_id = yt.site_id and yt2.id > yt.id ); delete from YourTable where ID in (select id from tmpTable); 

从Kostanos在评论中的build议:
上面唯一缓慢的查询是DELETE,如果你有一个非常大的数据库。 这个查询可能会更快:

 DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id 

如果IGNORE语句在我的情况下不起作用,您可以使用:

 CREATE TABLE your_table_deduped like your_table; INSERT your_table_deduped SELECT * FROM your_table GROUP BY index1_id, index2_id; RENAME TABLE your_table TO your_table_with_dupes; RENAME TABLE your_table_deduped TO your_table; #OPTIONAL ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`); #OPTIONAL DROP TABLE your_table_with_dupes; 

还有另一个解决scheme:

 DELETE t1 FROM my_table t1, my_table t2 WHERE t1.id < t2.id AND t1.my_field = t2.my_field AND t1.my_field_2 = t2.my_field_2 AND ... 

我有这个查询Snipet的SQLServer,但我认为它可以用在其他DBMS几乎没有变化:

 DELETE FROM Table WHERE Table.idTable IN ( SELECT MAX(idTable) FROM idTable GROUP BY field1, field2, field3 HAVING COUNT(*) > 1) 

我忘了告诉你,这个查询不会删除重复行的最低ID的行。 如果这适用于您尝试此查询:

 DELETE FROM jobs WHERE jobs.id IN ( SELECT MAX(id) FROM jobs GROUP BY site_id, company, title, location HAVING COUNT(*) > 1) 

更快的方法是将不同的行插入临时表中。 使用删除,我花了几个小时从800万行的表中删除重复。 使用插入和独特,只需要13分钟。

 CREATE TABLE tempTableName LIKE tableName; CREATE INDEX ix_all_id ON tableName(cellId,attributeId,entityRowId,value); INSERT INTO tempTableName(cellId,attributeId,entityRowId,value) SELECT DISTINCT cellId,attributeId,entityRowId,value FROM tableName; TRUNCATE TABLE tableName; INSERT INTO tableName SELECT * FROM tempTableName; DROP TABLE tempTableName; 

这个解决scheme将把重复项移动到一个表中 ,并将唯一项移到另一个 表中

 -- speed up creating uniques table if dealing with many rows CREATE INDEX temp_idx ON jobs(site_id, company, title, location); -- create the table with unique rows INSERT jobs_uniques SELECT * FROM ( SELECT * FROM jobs GROUP BY site_id, company, title, location HAVING count(1) > 1 UNION SELECT * FROM jobs GROUP BY site_id, company, title, location HAVING count(1) = 1 ) x -- create the table with duplicate rows INSERT jobs_dupes SELECT * FROM jobs WHERE id NOT IN (SELECT id FROM jobs_uniques) -- confirm the difference between uniques and dupes tables SELECT COUNT(1) AS jobs, (SELECT COUNT(1) FROM jobs_dupes) + (SELECT COUNT(1) FROM jobs_uniques) AS sum FROM jobs 

简单快速的所有情况下:

 CREATE TEMPORARY TABLE IF NOT EXISTS _temp_duplicates AS (SELECT dub.id FROM table_with_duplications dub GROUP BY dub.field_must_be_uniq_1, dub.field_must_be_uniq_2 HAVING COUNT(*) > 1); DELETE FROM table_with_duplications WHERE id IN (SELECT id FROM _temp_duplicates); 

我不断访问此页面我谷歌“删除重复的forms的MySQL”,但我的theIGNORE解决scheme不工作,因为我有一个InnoDB的MySQL表

这个代码随时都可以运行得更好

 CREATE TABLE tableToclean_temp LIKE tableToclean; ALTER TABLE tableToclean_temp ADD UNIQUE INDEX (fontsinuse_id); INSERT IGNORE INTO tableToclean_temp SELECT * FROM tableToclean; DROP TABLE tableToclean; RENAME TABLE tableToclean_temp TO tableToclean; 

tableToclean =你需要清理的表的名字

tableToclean_temp =创build和删除的临时表

我喜欢更具体一些我删除的logging,所以这里是我的解决scheme:

 delete from jobs c1 where not c1.location = 'Paris' and c1.site_id > 64218 and exists ( select * from jobs c2 where c2.site_id = c1.site_id and c2.company = c1.company and c2.location = c1.location and c2.title = c1.title and c2.site_id > 63412 and c2.site_id < 64219 ) 

你可以很容易地从这个代码删除重复的logging..

 $qry = mysql_query("SELECT * from cities"); while($qry_row = mysql_fetch_array($qry)) { $qry2 = mysql_query("SELECT * from cities2 where city = '".$qry_row['city']."'"); if(mysql_num_rows($qry2) > 1){ while($row = mysql_fetch_array($qry2)){ $city_arry[] = $row; } $total = sizeof($city_arry) - 1; for($i=1; $i<=$total; $i++){ mysql_query( "delete from cities2 where town_id = '".$city_arry[$i][0]."'"); } } //exit; } 

我不得不这样做与文本字段,并遇到索引的100个字节的限制。

我通过添加一个列来解决这个问题,做了一个字段的md5哈希,并做了修改。

 ALTER TABLE table ADD `merged` VARCHAR( 40 ) NOT NULL ; UPDATE TABLE SET merged` = MD5(CONCAT(`col1`, `col2`, `col3`)) ALTER IGNORE TABLE table ADD UNIQUE INDEX idx_name (`merged`);