jeudi 5 mars 2015

SQLite query (and subqueries) for lagging a column by group

I have a SQLite table myTable with some 3.2 billion records. It has records of workers's status (0 or 1) in companies over time. The same worker can be in different companies. The table looks like this:



companyID workerID timeVar workerStatus
-------------------------------------
1 1 1 0
1 1 2 0
1 1 3 0
1 1 4 1
-------------------------------------
1 2 3 1
1 2 4 1
1 2 5 0
1 2 6 0
-------------------------------------
2 1 1 1
2 1 2 1
2 1 3 1
2 1 4 0
-------------------------------------
2 2 1 1
2 2 2 1
2 2 3 0
2 2 4 0
2 2 5 0
2 2 6 1
-------------------------------------


Note that companyID, workerID, timeVar are keys.


I want to create two columns:




  1. workerStatusMaxLag1 will say whether workerStatus in the immediate previous timeVar was equal to 1.




  2. workerStatusMaxLag2 will say whether workerStatus in the two immediate previous timeVar was equal to 1.




To be clear, this is what I am trying to achieve:



companyID workerID timeVar workerStatus workerStatusMaxLag1 workerStatusMaxLag2
1 1 1 0 NULL NULL
1 1 2 0 1 1
1 1 3 0 1 1
1 1 4 1 1 1
-----------------------------------------------------------------------------
1 2 3 1 1 1
1 2 4 1 1 1
1 2 5 0 1 1
1 2 6 0 0 1
-----------------------------------------------------------------------------
2 1 1 1 NULL NULL
2 1 2 1 1 1
2 1 3 1 1 1
2 1 4 0 1 1
-----------------------------------------------------------------------------
2 2 1 1 NULL NULL
2 2 2 1 1 1
2 2 3 0 1 1
2 2 4 0 1 1
2 2 5 0 1 1
2 2 6 1 0 1
-----------------------------------------------------------------------------


Note


Workers can move across companies, and that the two new columns in this table vary only within workerID and timeVar, but not across companyID, save for the fact that timeVar can be different for each combination of company x worker.


I am working in R, but given that my data is so large, I think I am safer if I do all of this with SQL statements alone, and not load (slices of) the data into R.


What have I tried


I am new to SQL and I have tried to give some baby steps, but have not gone too far. I think the first step in building my query is to



SELECT companyID, workerID, timeVar, MAX(workerStatus)
FROM myTable
GROUP BY companyID, workerID, timeVAR


So that I know workerStatus in each period. Then I thought of joining this back to the table, but first I need to calculate the lags.


I have looked but didn't go very far. Also, there might be different approaches and performance may vary considerably. While I don't need top performance (I will do this one time only), I do have a pretty heavy table (3.2 billion rows, or a 100GB+ table worth of data), and I just don't want my machine to crash (I'm on Windows 7, 16GB RAM, Intel i7-2600).


Aucun commentaire:

Enregistrer un commentaire