String Similarity Joins

String Similarity Joins: An Experimental Evaluation

Yu Jiang

Guoliang Li

Jianhua Feng

Yu Jiang

Guoliang Li

Jianhua Feng

Abstract

String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.

Download

Our publication in VLDB2014: pdf version.
Elf executables(64-bits, static-linked, built by gcc 4.8.2 on ubuntu 14.04): elfs.zip.
Source code
Datasets(in .tar.bz2 format): dblp(37MB), enron(59MB), querylog(7MB), trec(70MB), word(1MB).
Please cite the following paper if you use the code and datasets.
Yu Jiang, Guoliang Li, Jianhua Feng, Wen-Syan Li: String Similarity Joins: An Experimental Evaluation. PVLDB 7(8): 625-636 (2014) Bib File

Input Format

Edit Distance

The input file should be a plain text file, where each line represents a record string. The above datasets can be used directly for edit distance.

Token-based Metrics

The input file should be a binary file, where each record is a tuple of three fields: (id, length,tokens). There will be 2 + length 32-bit integers to represent the record. Here is an example input file with 3 records:

id = 0, length = 3, tokens = {1,2,4}
id = 1, length = 3, tokens = {0,3,4}
id = 2, length = 4, tokens = {1,2,3,4}

The file content using od -x:

              0000000 0000 0000 0003 0000 0001 0000 0002 0000
              0000020 0004 0000 0001 0000 0003 0000 0000 0000
              0000040 0003 0000 0004 0000 0002 0000 0004 0000
              0000060 0001 0000 0002 0000 0003 0000 0004 0000
              0000100

The input file should also satisfy the following restrictions:

A record should not contain duplicate tokens.
The tokens should be sorted in ascending order.
All the records should be sorted by length.

Program Usage

Here are some notations:

[q] stands for the length of q-gram, typically in [3,8] for most algorithms.
[method] stands for different methods for token-based metric and should be one of "jaccard", "cosine" and "dice".
[tau] stands for the threshold for edit distance.
[delta] stands for the threshold for token-based metrics.
[file] stands for the data file.

Edit Distance

Method	Executable Name	Command Line
BiTrieJoin	bitriejoin-ed	./bitriejoin-ed [file] [tau]
TrieJoin	triejoin-ed	./triejoin-ed [file] [tau]
PassJoin	passjoin-ed	./passjoin-ed [file] [tau]
FastSS	fastss-ed	./fastss-ed [file] [tau]
M-Tree	mtree-ed	./mtree-ed [file] [tau]
Allpair	allpair-ed	./allpair-ed [file] [tau] [q]
EDJoin	edjoin-ed	./edjoin-ed [file] [tau] [q]
QChunk	qchunk-indexgram-ed, qchunk-indexchunk-ed	./qchunk-indexgram-ed [file] [tau] [q], ./qchunk-indexchunk-ed [file] [tau] [q]
PartEnum	partenum-ed	./partenum-ed [file] [tau] [N1] [N2]^[1]
ListMerger	listmerger-ed	./listmerger-ed [file] [tau] [q] [type]^[2]
AdaptJoin	adaptjoin	./adaptjoin ed [gram\|indexgram\|indexchunk]^[3] [file] [tau] [q]

[1]. For the detail of N1 and N2, please see reference [1] of our paper.
[2]. type = 0: HeapMerger, type = 1: MergeOpt, type = 2: ScanCount, type = 3: MergeSkip, type = 4: DivideSkip. For more details, please see the flamingo project and the reference [13] and [18] of our paper.
[3]. Should be one of "gram", "indexgram" and "indexchunk". For more details, please read our paper or reference [23].

Token-based Metrics

Method	Executable Name	Command Line
Allpair	allpair-token	./allpair-token [method] [file] [delta]
PPJoin+	ppjoinplus-token	./ppjoinplus-token [method] [file] [delta]
PPJoin	ppjoin-token	./ppjoin-token [method] [file] [delta]
AdaptJoin	adaptjoin	./adaptjoin [method] [file] [delta]
ListMerger	listmerger-token	./listmerger-token [method] [file] [delta] [type]^[1]
PartEnum	partenum-token	./partenum-token [method] [file] [delta]
PassJoin	passjoin-token	./passjoin-token [method] [file] [delta]

[1]. type = 0: HeapMerger, type = 1: MergeOpt, type = 2: ScanCount, type = 3: MergeSkip, type = 4: DivideSkip. For more details, please see the flamingo project and the reference [13] and [18] of our paper.

Output Format

The output usually contains 4 lines: the total result number, the total candidates number, the overall running time and the peak memory usage.
Notice that we don't output the join results at the current time, but it's easy to do that by modifing the source code if necessary.

Contact

For any question about this study, please feel free to contact Guoliang Li.