ES(五)Why Skip List

Why Skip List ?

为了支持全文搜索,Lucene中使用了倒排索引,但是倒排索引也有很多分词后的词(Term),而且这些词的数目可能是非常大的,那如何对这些倒排索引的词进行快速的检索呢?很容易想到的就是在对这些词建个索引么,直接想到的就是,量大的化内存放不下,那就像Mysql一样,B+Tree数据结构搞个索引。但是在luence中,使用的是skip-list,why?

首先,这个问题,第一个让我联想到的是Java的CurrentSkipListMap,当然可以有CurrentTreeMap,但是Java8还没有。如果有,那有什么区别?其实本质都是数据结构的问题,skiplist与红黑树都支持范围的查询,而是时间复杂度也一样的。唯一的区别就是在频繁的插入或删除情况下,为了有更好的查找性能,树需要经常的旋转和调整,而调整的过程中,需要锁住很多的节点,这也是这个skiplist数据结构的特点,并发好。

引用一个回答:How does lucene index documents?

For those curious why Lucene uses Skip-Lists, while most databases use (B+)- and/or (B)-Trees, take a look at the right SO answer regarding this question (Skip-Lists vs. B-Trees). That answer gives a pretty good, deep explanation - essentially, not so much make concurrent updates of the index “more amenable” (because you can decide to not re-balance a B-Tree immediately, thereby gaining about the same concurrent performance as a Skip-List), but rather, Skip-Lists save you from having to work on the (delayed or not) balancing operation (ultimately) needed by B-Trees (In fact, as the answer shows/references, there is probably very little performance difference between B-Trees and [multi-level] Skip-Lists, if either are “done right.”)

分析下上面的答主关键点:

如果说RBTree在更新的时候是需要锁住很多的节点的,那Skip-Lists呢?锁住的很少只需要相邻的节点即可。

另外,在Skip List vs. Binary Tree中一个高票回答,也说明了,基于tree的实现在并发上能够达到Skip List的性能,但是,非常容易screw-up:

Gramoli’s conclusion is that’s much easier to screw-up a CAS-based concurrent tree implementation than it is to screw up a similar skip list.

而且回答中还带了论文的数据,不过没打开。

总结,why Lucene use Skip List not B-Tree ? 在达到同样性能的条件下,使用Skip List更容易实现。

以上,是目前查到资料得到的答案,后续待补。

参考

Lucene学习总结之三:Lucene的索引文件格式 (1)
Lucene’s RAM usage for searching
Apache Lucene - Index File Formats

Table of Contents