sqlite: SQLite query (and subqueries) for lagging a column by group

jeudi 5 mars 2015

SQLite query (and subqueries) for lagging a column by group

I have a SQLite table myTable with some 3.2 billion records. It has records of workers's status (0 or 1) in companies over time. The same worker can be in different companies. The table looks like this:


companyID   workerID    timeVar     workerStatus
-------------------------------------
1           1           1           0
1           1           2           0
1           1           3           0
1           1           4           1
-------------------------------------
1           2           3           1
1           2           4           1
1           2           5           0
1           2           6           0
-------------------------------------
2           1           1           1
2           1           2           1
2           1           3           1
2           1           4           0
-------------------------------------
2           2           1           1
2           2           2           1
2           2           3           0
2           2           4           0
2           2           5           0
2           2           6           1
-------------------------------------

Note that companyID, workerID, timeVar are keys.

I want to create two columns:

workerStatusMaxLag1 will say whether workerStatus in the immediate previous timeVar was equal to 1.

workerStatusMaxLag2 will say whether workerStatus in the two immediate previous timeVar was equal to 1.

To be clear, this is what I am trying to achieve:


companyID   workerID    timeVar     workerStatus    workerStatusMaxLag1     workerStatusMaxLag2
1           1           1           0               NULL                    NULL
1           1           2           0               1                       1
1           1           3           0               1                       1
1           1           4           1               1                       1
-----------------------------------------------------------------------------
1           2           3           1               1                       1
1           2           4           1               1                       1
1           2           5           0               1                       1
1           2           6           0               0                       1
-----------------------------------------------------------------------------
2           1           1           1               NULL                    NULL
2           1           2           1               1                       1
2           1           3           1               1                       1
2           1           4           0               1                       1
-----------------------------------------------------------------------------
2           2           1           1               NULL                    NULL
2           2           2           1               1                       1   
2           2           3           0               1                       1
2           2           4           0               1                       1
2           2           5           0               1                       1
2           2           6           1               0                       1
-----------------------------------------------------------------------------

Note

Workers can move across companies, and that the two new columns in this table vary only within workerID and timeVar, but not across companyID, save for the fact that timeVar can be different for each combination of company x worker.

I am working in R, but given that my data is so large, I think I am safer if I do all of this with SQL statements alone, and not load (slices of) the data into R.

What have I tried

I am new to SQL and I have tried to give some baby steps, but have not gone too far. I think the first step in building my query is to


SELECT companyID, workerID, timeVar, MAX(workerStatus)
FROM myTable
GROUP BY companyID, workerID, timeVAR

So that I know workerStatus in each period. Then I thought of joining this back to the table, but first I need to calculate the lags.

I have looked but didn't go very far. Also, there might be different approaches and performance may vary considerably. While I don't need top performance (I will do this one time only), I do have a pretty heavy table (3.2 billion rows, or a 100GB+ table worth of data), and I just don't want my machine to crash (I'm on Windows 7, 16GB RAM, Intel i7-2600).

sqlite

jeudi 5 mars 2015

SQLite query (and subqueries) for lagging a column by group

Note

What have I tried

Aucun commentaire:

Enregistrer un commentaire