mardi 3 novembre 2015

retrieve SQLite rows by rowid efficiently

I am using the C interface to SQLite and have some basic questions about the rowid field and how to efficiently retrieve data from an arbitrary set of rows with known rowids. I actually have several related questions, so I will call them out in bold as I go. But the main questions I have are at the end.

I have a table:

sqlite3_exec( db, "create table mytable ( value BLOB, value2 TEXT ) )", NULL, NULL, NULL );

and I fill it with 2.3 million rows. I also create two indexes on the table:

sqlite3_exec( db, "CREATE INDEX r_index ON mytable (rowid)", NULL, NULL, &errorMessage );

sqlite3_exec( db, "CREATE INDEX v_index ON mytable (value)", NULL, NULL, &errorMessage );

I am aware that the rowid index is unnecessary. I see that SQLite takes 0 sec to "create" the rowid index, and I believe this is because rowid is always an implicit existing "index" on a table, since the table is (usually?) stored in rowid order.

In any case, what I want to be able to do is retrieve arbitrary sets of rows quickly from this table, by rowid. What I do is create an in-memory list of records:

class MyInMemoryIndexElement
{
public:
    sqlite3_int64 _rowId;
    MyKeyType _key;
}

vector<ObjectsInMemoryIndexElement> inMemoryIndex;

rc = sqlite3_prepare_v2( db, "select rowid, value from mytable" ), -1, &stmt, NULL );

for ( ; sqlite3_step( stmt ) == SQLITE_ROW ; )
{
    MyInMemoryIndexElement e;
    e._rowId = sqlite3_column_int64( stmt, 0 );
    e._key = GetMyKeyFromValueBlob( sqlite3_column_blob( stmt, 1 ) );
    inMemoryIndex.push_back( e );
}

The loop above, reading through all 2.3 million records and creating this in-memory vector of records, takes only 1.5 seconds (and could probably be made faster by preallocating space for the vector). (In fact, when I turn off the part about actually adding the record to the vector, the time for the query alone is only 0.95 sec. And even more amazing, when I use a sqlite3_exec() with a callback function, instead of the statement/step method, I can read all of the "value" blobs in the database in 0.55 sec.) I found that if I do not have an index on the table by the "value" field, these select statements take about 5 seconds longer. (Not my main question, but I already don't understand why indexing by the "value" column would make it faster to query the table for all rows to get the "value" from each row, but maybe the search engine can actually use the values stored in the index instead of having to read the values from the table itself?)

Another important comment is that when I step through that loop in the debugger, I see that the rows are processed in an unexpected order. I was thinking that I would get rowid 1 first, then rowid 2, and so on, since I'm not specifying anything about sorting, and I'm just asking it to give me all the rows one at a time. However, what I find is that the first rowid I get is somewhere in the 600,000's, and then the rowids jump around from there. So maybe that's because it's returning the rows in the order of the "value" index, which is some b-tree order that has nothing to do with the physical record / rowid order?

Anyway, so now I have this index in memory, and at various times in the program I want to walk through that table, and check the _key of each entry, and if that _key has certain properties, I want to get the "value" for that guy. So I have a loop:

sqlite3_stmt *stmt;
rc = sqlite3_prepare_v2( db, "select value from mytable where rowid = ?" ).c_str(), -1, &stmt, NULL );

for ( int i = 0 ; i < inMemoryIndex.size() ; i++ )
{
    if ( MySpecialFunction( inMemoryIndex[ i ]._key ) )
    {
        sqlite3_reset( stmt );
        sqlite3_clear_bindings( stmt );
        sqlite3_bind_int64( stmt, 1, inMemoryIndex[ i ]._rowId );

        if ( sqlite3_step( stmt ) == SQLITE_ROW )
        {
            const void *v = sqlite3_column_blob( stmt, 0 );
            DoWhatIWantWithV( v );
        }
    }
}

Unfortunately (and here we get to my main question), that loop takes about 1.6 seconds to run in the case that about 14,000 out of the 2.3 million records pass the MySpecialFunction() test. That is, it takes about 1.6 seconds to read 14,000 records, whereas it took only 0.55 seconds to read all 2.3 million records.

Because of the strange rowid ordering mentioned above, I did try sorting inMemoryIndex by rowid. This made it run in about 1.3 seconds instead of 1.6.

So my main question is:

I am able to use the statement/step to select every "value" blob in the 2.3 million row database in 0.95 sec (and in fact if I use the sqlite3_exec() method with a callback I can do it in 0.55 sec).

I went to the trouble of creating my inMemoryIndex vector because in most cases at any given time I only want records for a small subset of the 2.3 million rows, for instance 14,000 of them. So I thought if I knew these 14,000 rowid's I could "just read those rows". But when I do that with the

"select value from mytable where rowid = ?"

statement iteratively binding to each of the known rowid's, it takes 1.6 seconds, significantly longer than reading every row in the database.

So:

(1) Is there a small change I could make to this approach (e.g., some other index, order of operations, etc.) that could speed it up?

(2) Is there something fundamentally flawed about this way of doing things?

*(I should comment that do realize that creating my own in-memory index like this is going against the idea that I should leave query planning up to the SQL engine itself. I'm doing it this way because in general my logic for deciding which records I'm interested in at a given time -- as expressed by MySpecialFunction() in the code above -- is more complex than I think I can do in SQL logic. I'm open to the idea that I need to reconsider that. But for now my question is just about the fact that it seems surprising that it takes so much longer to read 14k records from known rowid's than it takes to read all 2.3 million records.)

Aucun commentaire:

Enregistrer un commentaire