I have an SQLite DB with a single table for storing Apache HTTP logs.
I import the Apache log text files using a Python script which converts the time stamp to a unix-epoch integer before insert.
To prevent duplicate lines being stored, I could either create a unique constraint using all columns, or I could add an hash column to store an MD5 of the line.
So I've tried the first idea - setting all columns as unique, and that works but it does make the DB file rather large.
My preferred solution is to use an MD5 hash column. But when I do this, inserts appear to get increasingly slow.
FYI the dataset I'm testing with is a little over 8 million lines.
The really weird thing (weird to me anyway) is, if I create a unique constraint on unix_epoch_timestamp and md5_hash, then the inserts don't noticeably slow down.
So I've got a work-around for my problem! But I'm intrigued as to what's causing this.
I'm using SQLite 3.8.2 on Ubuntu 14.04 with an old Core-i5 with 4GB RAM.
With 8-millionish rows:
UNIQUE (unix_epoch_timestamp, md5_hash) ON CONFLICT FAIL takes approx. 15 minutes to import.
UNIQUE (md5_hash) ON CONFLICT FAIL takes a very long time! I've never run it to the end, but I can't see it being less than an hour.
Aucun commentaire:
Enregistrer un commentaire