mercredi 16 septembre 2015

More Efficient SQLite Inner Join and Multithreading

I am having efficiency problems with my SQLite Inner Join on 10Gb of data per dataset.

I am running the following,

CREATE TABLE merged AS 
  SELECT * FROM xrootd 
  INNER JOIN condor ON 
     SUBSTR(xrootd.[U.mAppInfo], 0, INSTR(xrootd.[U.mAppInfo], '_')) == 
        SUBSTR(condor.User, 6, INSTR(condor.User, '@'))
    AND abs(xrootd.[F.mOpenTime] - condor.JobStartDate) <= 60;

I left it running without the time constraint on the join over the weekend, but it didn't finish, so I stopped it to add the time constraint because I was concerned that perhaps it was generating huge amounts of data explaining the long run time.

The SQL Columns look like,

sqlite> SELECT [U.mAppInfo], [F.mOpenTime] FROM xrootd LIMIT 5;
U.mAppInfo|F.mOpenTime
drberry_ST_t-channel_4f_leptonDecays_13TeV_1fv53a_156_1_vrm|1439703600.0
drberry_QCD_Pt_600to800_TuneCUETP8M1_13TeV_7ys53a_79_1_oLf|1439705100.0
drberry_ST_t-channel_4f_leptonDecays_13TeV_1fv53a_176_1_uwo|1439703660.0
drberry_QCD_Pt_800to1000_TuneCUETP8M1_13TeV_r5d9t3_72_1_6S1|1439705050.0

sqlite> SELECT User, JobStartDate FROM condor LIMIT 5;
uscms5616@cms|1439764892.0
uscms5050@cms|1439615574.0
uscms5111@cms|1439965560.0
uscms3850@cms|1440081782.0
uscms3850@cms|1440081527.0

Is there a better way to do this? I have tried simple BASH (stopped because it turned out to not be so simple) and Python (memory overflow). I can get at the data in the Root Tree Format from CERN, but SQLite seemed better for this.

My ideas

  • Make new tables with the pre-string splits ie. all the SUBSTR(..., INSTR(...))
  • Get a unique list of Users that exist in both and run two inner joins
  • Same as above but for times
  • SQLite Multithreading if there is a simple way to do this

I would prefer to not run these for days without knowing what might work better, but I am leaving it running in the mean time.

Aucun commentaire:

Enregistrer un commentaire