jeudi 21 mai 2015

How to get Employee --> Manager without separate tables (pandas or SQL)?

Is there a code to do this in pandas simply? Or perhaps I should try SQL (I am not very familiar with SQL). Here is what I have so far (fake example, the real one has about 20,000 individuals)

employee_df =

    EmpID   Name    Title   ManagerID
    abc123   John   Head    pqr456
    pqr456   Jake   VP      bs92999
    zyx987   Jill   Lead    abc123
    bs92999  Bob    SVP     NaN

Duplicate the dataframe: manager_df = employee_df

roster = pd.merge(manager_df, employee_df, how='outer', left_on ='ManagerID', right_on = 'EmpID')

My output is messy, although seemingly correct (it is telling me who the manager is without having separate tables by referring to itself).

    EmpID_x Name_x  Title_x ManagerID_x  EmpID_y Name_y Title_y ManagerID_y
    abc123  John    Head      pqr456    pqr456   Jake       VP          bs92999
    pqr456  Jake    VP        bs92999   bs92999  Bob        SVP         NaN
    zyx987  Jill    Lead      abc123    abc123   John       Head        pqr456
    bs92999 Bob     SVP       NaN       NaN      NaN        NaN        NaN
    NaN     NaN     NaN       NaN       zyx987   Jill       Lead        abc123

The most common desired output would be:

EmpID | Name | Title | Manager_Name

But sometimes I would need another level too (the boss of the boss), with a max potential of about 5 layers which seems crazy but there is a lot of hierarchy - such high layers is not necessary all of the time but I would like to be able to roll this data up to a higher level if needed:

EmpID | Name | Title | Manager_Name_1 | Manager_Name_2

The third dataframe is reporting_df:

EmpID | ManagerID | StartDate | EndDate

Sometimes there are management changes that happen even mid-month and the results should reflect on the manager who 'owned' that employee on that date.

file = any file or report that has the EmpID where I might want to find out who the manager (or their manager) was on the date which is also included on the file. Is this the right approach to solve for this?

for i in range(len(file)):  
        file.ix[i,'Manager'] = reporting_df[(reporting_df.StartDate.shift(-1) > file.StartDate[i]) &(reporting_df.StartDate <= file.Date[i])]

Aucun commentaire:

Enregistrer un commentaire