sqlite: SQLite full text search with unicode symbols without unique ascification

dimanche 15 mars 2015

SQLite full text search with unicode symbols without unique ascification

My sqlite table MyTable is scraped from the internet (created as a virtual fts4 table). It has a column "author", and I have many rows in MyTable whose author field is "W. Lück", "W. Lueck" or "W. Luck".

The point is that these entries correspond to just one person, i.e. some unicode symbols are ascified in many different ways "in the wild". When I do a full search

SELECT * FROM MyTable WHERE author MATCH 'Lück'

obviously I won't pick up the rows whose "author" column is "W. Luck" or "W. Lueck". My understanding is that if I use unicode61 tokenizer I will still not pick up "W. Lueck".

So what I do right now is I have an extra column in my table "author_search_terms" which encodes potential ascifications of umlauts. For example:

if the author field is "W. Lück" then author_search_terms is "lück lueck luck"

if the author field is "W. Lueck" then author_search_terms is "lueck luck" (More generally, if the author field was "Gaeluepoe" then the author_search_terms would be "Gaeluepoe Galupo")

if the author field is "W. Luck" then author_search_terms is "luck"

When I form a search query I query against the "author_search_terms" using OR operator. For example, if the user wants to search for rows whose author field contains Lueck, my application submits the query SELECT * FROM MyTable WHERE author_search_terms MATCH 'Lueck OR Luck'

For the moment this seems to work fine, at least for the cases I came across, but of course it's a mess. Any suggestions for a better solution? IT should be "scalable", i.e. it should be possible to easilly add more non-unique ascifications: to give a non-umlaut example, the danish surname Følner is sometimes written as Folner or Foelner.

If you have experience with fts4, would it be possible (and better) to write a custom tokenizer?

sqlite

dimanche 15 mars 2015

SQLite full text search with unicode symbols without unique ascification

Aucun commentaire:

Enregistrer un commentaire