I am working on a research application that has to store information about a large amount of objects. The way the data is structured, every object has a (non-unique) name as well as an undefined number of properties. Each property can either be a string or an integer with an ID that defines the type of property. For example, most objects have a number of related objects that are referenced by their integer ID and a property ID that defines the type of reference like Wikipedia:ArticleReference or User:ManualReference. As you can see, each type of property is defined by a data providing module in the application, so the database itself has to be flexible and can't just use relational tables.
The smaller dataset I work on includes 8,802,271 objects and 29,534,326 properties but only has most of the commonly occuring properties on a small amount of objects so I expect the actual number of properties in the set to be closer to 2-3 billion. Over time the number of objects will increase to several hundred million to possibly over a billion and as more data sources are added, the ratio of objects to properties will move from 1:200 - 1:300 to 1:500 - 1:1500 or more.
In the beginning I stored the data in a simple custom format which was fast to insert into and easy to manage but got hard to search pretty early on so I moved to an SQLite database. The current set of tables looks like this:
create table objects (id integer primary key, type integer, name varchar(255))
create table intdata (objectid unsigned bigint, propertytype bigint, value unsigned bigint)
create table chardata (objectid unsigned bigint, propertytype bigint, value varchar(255))
There is an index on most fields which is added when moving from inserting new data to the frontend user interface. SQLite is configured to not keep a journal, use multiple threads and allocate most of memory as page cache.
The primary use case will be searching for an object by name and then browsing its properties and connected objects which is all done using the primary key. My main concern is the speed of the initial search. With the comparatively small amount of data I work on and an index on the name column it already takes about five seconds to do a simple select by name on my development machine (3rd gen i7, 6 GB of RAM, system and DB on separate SSDs, DB already cached in memory). Loading the actual data for a given ID selected from the results happens almost instantly. Insert performance is not as important as the data itself only arrives at a few thousand items per second which can be handled by any reasonable database.
The current file is about 2.6 GB in size, which is around 50% larger than the custom solution. I'd like to keep it relatively small so it I can keep it on an SSD on my machine while still working on the backend. If needed I can also set up a dedicated server with at least an i5, 8 GB of RAM and a few TB of HDD just from parts I have lying around. I can also get Windows and SQL Server licenses for any version through Dreamspark.
These are a few ideas I have had so far:
- Add a custom tree-based index for the names
- Abandon relational DBs and use a system like MongoDB + a Name-to-ID index since all references are by index and data is per-object
- Am I doing text indexes wrong?
- I read a wide variety of things about SQLite hitting a performance wall on large DBs. What of that applies to my use case?
Please don't restrict your answers to SQLite or the ideas above. I am willing to learn and use any other system. I actually only used SQLite because I wanted to learn more about it and it is fairly fast for building smaller datasets. The only restriction is that I need some non-hacky way to connect it to a .net application.
Thanks for you help.
Aucun commentaire:
Enregistrer un commentaire