Document(Under Construction!)

Document / Word

「文件」。一大堆字串。

「單字」。一個字串。

tokenizing

index / n-gram / k-mer

ScanCount
Efficient Merging and Filtering Algorithms for Approximate String Searches

frequency / correlation

topic model = collaborative filtering
TF-IDF
word2vector

top-k / queryselector

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5445192/
wavelet tree