MySQL和NoSQL：帮助我select正确的一个

有一个很大的数据库，有10亿行，被称为线程（这些线程实际上是存在的，我并不是因为我喜欢它而变得更难）。（int id，string hash，int replycount，int dateline（timestamp），int forumid，string标题）

查询：

select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100

因为有1G的logging，这是一个很慢的查询。所以我想，让我们把这个1G的logging分成许多个论坛（类）我有！这几乎是完美的。有很多桌子我有更less的loggingsearch周围，它真的更快。现在查询变成：

select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100

99％的论坛（类别）的速度真的很快，因为大多数论坛只有less数几个主题（100k-1M）。但是，因为有一些约10M的logging，一些查询仍然是慢（0.1 / 0.2秒，对我的应用程序！， 我已经使用索引！ ）。

我不知道如何改善这个使用MySQL。有没有办法？

对于这个项目，我将使用10台服务器（12GB RAM，4x7200rpm硬盘，软件RAID 10，四核）

这个想法是简单地将数据库拆分成服务器，但是上面解释的问题仍然不够。

如果我在这10台服务器上安装cassandra（假设我find时间让它工作，它应该是这样），我应该假设有一个性能提升？

我该怎么办？ 继续使用分布式数据库在多台机器上的MySQL或build立一个cassandra集群？

我被要求发布什么是索引，在这里他们是：

 mysql> show index in thread; PRIMARY id forumid dateline replycount

select说明：

 mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250; +----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+ | 1 | SIMPLE | thread | ref | forumid | forumid | 4 | const,const | 221575 | Using where; Using filesort | +----+-------------+--------+------+---------------+---------+---------+-------------+--------+-----------------------------+

您应该阅读以下内容，了解一下devise良好的innodb表的优点以及如何最好地使用聚簇索引 – 只适用于innodb！

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

然后按照以下简化的例子devise你的系统：

示例模式（简化）

重要的特性是这些表使用innodb引擎，并且线程表的主键不再是单个auto_incrementing键，而是基于forum_id和thread_id组合的复合聚集键。例如

 threads - primary key (forum_id, thread_id) forum_id thread_id ======== ========= 1 1 1 2 1 3 1 ... 1 2058300 2 1 2 2 2 3 2 ... 2 2352141 ...

每个论坛行都包含一个名为next_thread_id（unsigned int）的计数器，该计数器由触发器维护，并在每次将线索添加到给定论坛时递增。这也意味着如果为thread_id使用一个auto_increment主键，我们可以在每个论坛中存储40亿个线程，而不是总共40亿个线程。

 forum_id title next_thread_id ======== ===== ============== 1 forum 1 2058300 2 forum 2 2352141 3 forum 3 2482805 4 forum 4 3740957 ... 64 forum 64 3243097 65 forum 65 15000000 -- ooh a big one 66 forum 66 5038900 67 forum 67 4449764 ... 247 forum 247 0 -- still loading data for half the forums ! 248 forum 248 0 249 forum 249 0 250 forum 250 0

使用组合键的缺点是你不能再通过一个键值来select一个线程，如下所示：

 select * from threads where thread_id = y;

你必须做：

 select * from threads where forum_id = x and thread_id = y;

但是，您的应用程序代码应该知道用户正在浏览哪个论坛，因此实现并不难 – 将当前查看的forum_id存储在会话variables或隐藏表单字段等中。

这是简化的模式：

 drop table if exists forums; create table forums ( forum_id smallint unsigned not null auto_increment primary key, title varchar(255) unique not null, next_thread_id int unsigned not null default 0 -- count of threads in each forum )engine=innodb; drop table if exists threads; create table threads ( forum_id smallint unsigned not null, thread_id int unsigned not null default 0, reply_count int unsigned not null default 0, hash char(32) not null, created_date datetime not null, primary key (forum_id, thread_id, reply_count) -- composite clustered index )engine=innodb; delimiter # create trigger threads_before_ins_trig before insert on threads for each row begin declare v_id int unsigned default 0; select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id; set new.thread_id = v_id; update forums set next_thread_id = v_id where forum_id = new.forum_id; end# delimiter ;

您可能已经注意到我已经包含了reply_count作为主键的一部分，这有点奇怪，因为（forum_id，thread_id）composite本身是唯一的。这只是一个索引优化，当执行使用reply_count的查询时，可以节省一些I / O。有关详细信息，请参阅上面的2个链接。

示例查询

我仍然加载数据到我的示例表，到目前为止，我有一个加载约。 5亿行（系统的一半）。当加载过程完成时，我应该有大约：

 250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)

我故意让一些论坛包含超过500万线程的例子，论坛65有1500万线程：

 forum_id title next_thread_id ======== ===== ============== 65 forum 65 15000000 -- ooh a big one

查询运行时间

 select sum(next_thread_id) from forums; sum(next_thread_id) =================== 539,155,433 (500 million threads so far and still growing...)

根据innodb总结next_thread_ids给总线程数比通常快得多：

 select count(*) from threads;

论坛65有多less个线程：

 select next_thread_id from forums where forum_id = 65 next_thread_id ============== 15,000,000 (15 million)

再次比通常快：

 select count(*) from threads where forum_id = 65

好吧，现在我们知道我们目前有大约5亿个线程，论坛65有1500万个线程 – 让我们来看看架构如何执行:)

 select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32; runtime = 0.022 secs select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100; runtime = 0.027 secs

看起来相当高性能 – 所以这是一个单一的表，有超过500万行（并在不断增长），查询在0.02秒内覆盖1500万行（在加载时）！

进一步优化

这些将包括：

按范围分区
分片
扔钱和硬件在它

等等…

希望你find这个答案有帮助:)

编辑：你的一列指数是不够的。您至less需要覆盖三个相关的列。

更高级的解决scheme：通过在replycount > 1时创build一个等于1的新的hasreplies字段，用hasreplies = 1replacereplycount > 1 。完成此操作后，按以下顺序在三列上创build一个索引： INDEX(forumid, hasreplies, dateline) 。确保它是一个支持订购的BTREE索引。

您select的基础是：

给定的forumid
一个给定的hasreplies
按datelinesorting

一旦你这样做，你的查询执行将涉及：

向下移动BTREE以查找与forumid = X匹配的子树。这是一个对数操作（持续时间：日志（论坛的数量））。
向下移动BTREE以查找匹配hasreplies = 1的子树（同时仍然匹配forumid = X ）。这是一个常量操作，因为只有0或1。
在datesorting的子树中移动以获得所需的结果，而不必读取和重新整理论坛中的整个项目列表。

我之前对replycountbuild立索引的build议是不正确的，因为这将是一个范围查询，因此无法使用datelinesorting来对结果进行sorting（所以，您可以非常快速地select线索，但生成的百万行列表在查找所需的100个元素之前，必须完全sorting）。

重要提示 ：虽然这可以提高所有情况下的性能，但是您的巨大OFFSET值（10000！）将会降低性能，因为尽pipe直接通过BTREE读取，MySQL似乎仍然无法跳过。所以，OFFSET越大，请求就会越慢。

恐怕OFFSET问题不能通过将计算分散在多个计算上来自动解决（不pipe怎么样，你怎么跳过一个并行的偏移？）或者移动到NoSQL。所有的解决scheme（包括NoSQL的）都将归结为模拟基于datedateline OFFSET（基本上说dateline > Y LIMIT 100而不是LIMIT Z, 100 ，其中Y是偏移量Z处的项目的date）。这可以起到作用，并消除与偏移有关的任何性能问题，但是可以防止直接进入200页中的第100页。

有关于NoSQL或MySQL选项的部分问题。其实这是隐藏在这里的一个基本的东西。对于计算机来说，SQL语言很容易编写，而且难以阅读。在高容量数据库中，我build议避免SQL后端，因为这需要额外的步骤 – 命令parsing。我做了大量的基准testing，有些情况下SQLparsing器是最慢的一点。你无能为力。好吧，你可以使用预分析语句并访问它们。

顺便说一句，它不是广为人知，但MySQL已经从NoSQL数据库发展而来。 MySQL David和Monty的作者是数据仓库公司，他们经常不得不为非常规任务编写定制解决scheme。这导致了大量的自制C库，当Oracle和其他人performance不佳时，用来手动编写数据库函数。在1996年为了乐趣，SQL被添加到这个近20年的动物园。你知道后发生了什么事。

其实你可以避免使用MySQL的SQL开销。但通常SQLparsing不是最慢的部分，但只是很好的知道。为了testingparsing器的开销，你可以做一个基准例如“SELECT 1”）。

您不应该试图将数据库架构适合您计划购买的硬件，而应该计划购买硬件以适合您的数据库架构。

一旦你有足够的内存来保存内存中的索引工作集，所有可以使用索引的查询将会很快。确保您的密钥缓冲区设置得足够大以容纳索引。

所以如果12GB还不够用的话，不要用10台12GB内存的服务器，32GB或者64GB的内存使用less一些。

索引是必须的 – 但要记住要select正确的索引types：当在WHERE子句中使用带有“<”或“>”的查询时，BTREE更适合，而当一列中有许多不同的值时，HASH更合适您在WHERE子句中使用“=”或“<=>”。

进一步阅读http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

MySQL和NoSQL：帮助我select正确的一个

示例模式（简化）

示例查询

查询运行时间

进一步优化

Cassandra：text vs varchar

MongoDB vs. Redis与Cassandra之间的快速写入临时行存储解决scheme

数据库索引的sortingstring表（SSTable）或B +树？

何时不使用Cassandra？

什么是SSTable？

为什么HBase是比Cassandra和Hadoop更好的select？

Redis，CouchDB还是Cassandra？

大规模数据处理Hbase vs Cassandra

如何在Django框架中使用Cassandra

Cassandra中的分区键，组合键和集群键之间的区别？