mardi 30 juin 2015

How to full text search in SQLite without tokenizing, but rather looking for substrings

I would like to utilize the full-text search features (FTS3 or FTS4 extensions) of SQLite to search through biological (DNA, protein) sequences. Unlike what seems to be the designer use-case for FTS, I'd like my search to match any substring of a specific column, with no regard to delimiters or tokens.

I saw I can limit the columns being indexed by using the notindexed= option, and that I can use prefix matching by using the prefix= option. Since my sequences don't include any delimiters (they only include latin alphabet characters), I can regard each of the sequences as a single token, and set up a prefix search. This has two problems, though. The first is that I'd like to search through substrings, not only prefixes. The second is that each sequence will be stored twice, if I'm not mistaken - once as the contents, and the second time as a token. I could create a contentless table to overcome the second caveat, but I don't know how to address the first.

To conclude, my question is: is there a way to full-text search (using an index) on substrings which aren't prefixes in SQLite?

Thanks!

1 commentaire:

  1. While it is about 6 years late, they now have a trigram tokenizer capable of this in FTS5.

    RépondreSupprimer