vendredi 3 avril 2015

Storing a large table on disk, with fast retrieval of a specified subset to np.ndarray

I need to store a table on disk, and be able to retrieve a subset of that table into a numpy.ndarray very fast. What's the best to do that? I don't mind spending time to preprocess this dataset before storing on disk, since it won't be changed after it's stored.


I am considering HDF5, sqlite, numpy memmap, custom binary format, or anything else really (even a commercial app). I'd prefer not to write any C code, and instead rely on existing python libraries.


Details:


~100 million rows, ~5 columns of float and str data. One of the columns contains 100,000 different identifiers (so there are about 1000 rows per identifier). The subset to be retrieved is always specified by a set of identifiers (usually I need to retrieve ~2000 identifiers, so ~2% of the entire dataset).


Python 3.4, Linux, SSD drive (so random access is as fast as sequential).


Aucun commentaire:

Enregistrer un commentaire