sqlite: SQLite slow inserts when MD5 column set as unique

jeudi 18 février 2016

SQLite slow inserts when MD5 column set as unique

I have an SQLite DB with a single table for storing Apache HTTP logs.

I import the Apache log text files using a Python script which converts the time stamp to a unix-epoch integer before insert.

To prevent duplicate lines being stored, I could either create a unique constraint using all columns, or I could add an hash column to store an MD5 of the line.

So I've tried the first idea - setting all columns as unique, and that works but it does make the DB file rather large.

My preferred solution is to use an MD5 hash column. But when I do this, inserts appear to get increasingly slow.

FYI the dataset I'm testing with is a little over 8 million lines.

The really weird thing (weird to me anyway) is, if I create a unique constraint on unix_epoch_timestamp and md5_hash, then the inserts don't noticeably slow down.

So I've got a work-around for my problem! But I'm intrigued as to what's causing this.

I'm using SQLite 3.8.2 on Ubuntu 14.04 with an old Core-i5 with 4GB RAM.

With 8-millionish rows:

UNIQUE (unix_epoch_timestamp, md5_hash) ON CONFLICT FAIL takes approx. 15 minutes to import.

UNIQUE (md5_hash) ON CONFLICT FAIL takes a very long time! I've never run it to the end, but I can't see it being less than an hour.

sqlite

jeudi 18 février 2016

SQLite slow inserts when MD5 column set as unique

Aucun commentaire:

Enregistrer un commentaire