I'm modifying a prototype of a system generating heuristics from various data points. Right now I'm playing with a part of the code which reads in CSV files which contain a list of entities and counters for those entities Each file will have an entity only once, however, we will be reading in hundreds of files and duplicates can exist across files. The program should sum all counters for an entity and save to sql the entity and it's total count. Currently they do this by using an in-memory hash to increment the counter for an entity each time it's seen until they have a final sum once all files are read.
Once a count is generated it writes out files which contains SQL upsert statements to update the databases. It then calls sqllite program with the output file to generate/update the database; and we create a new database for every single day. Shockingly they have scaling problems trying to collect massive data using this single single, poorly optimized, Perl application, and sqllite! However, I'm not authorized to spend time tweaking this part of the code yet, as much as it bugs me.
I have to make significant changes to the reading in of CSV and format of the SQL files generated. As part of this change I'm considering doing away with the current in memory approach, replacing it with the 'dumb' solution of creating an upsert statement in the sql-file every time an entity is seen, even if this results in multiple upserts being run on a single entity. I want to do this partly to remove some really ugly Perl code, because I feel dirty every time I see their code and if they won't let me refactor it properly I'd rather find an excuse to toss it entirely, but also because this may open up the possibility of easier parallelizing of the code later when/if I'm authorized to do so. Basically, I think the code will be more maintainable with the dumb solution and thus would like to be dumb!
For now I'm trying to get a feel for what, if any, performance differences I may introduce by going dumb. I know the standard answer here is to try it out and see, but the problem is that I don't have access to much of the data yet so I can't do a realistic test. Worse, I'm having a hard time getting characteristics of the data we will be parsing, particularly how common duplicate entities across files are, as we are going to be running on different/more data points soon. Since each sqllight db is for one day ironically the DB likely contain somewhat fewer entities then the in-memory hash would. Our total heuristics (this being just a small part) is taking long enough that they are a little worried about scaling, not enough to give me free reign to properly parallelize the prototype, but enough that I can't de-optimize it unless I prove it won't do much harm.
In lack of an ability to do a real performance test, I'm wondering if someone can help give me a general ball-park idea for performance differences between hash and multiple upsert options to help me decide rather it's worth pushing to. My off-the-cuff feeling is that the CPU expense will be about the same, maybe slightly worse for multiple upserts simply because of limitations of sqllite compared to a real database. I'm assuming the real expense is the time lost to IO from reading and writing an extra sql line per duplicate to the SQL files? If IO is the main expense then I'm inclined to go with my dumb solution unless duplicates prove to be absurdly common, I don't think the IO expense will be high enough and it can be easily fixed with a simple parallelizing via running multiple instances of application to process different inputs. However, I don't know if there are reasons this would be noticeable slower for sqllite then I would expect. Are there any performance tradeoffs I'm unaware of large enough to compete with the IO tradeoffs I should be aware of?
Aucun commentaire:
Enregistrer un commentaire