samedi 2 avril 2016

Finding duplicates patterns in Many-To-Many relations

I use Pandas for pre-processing of CSV dataset and convert it to a SQLite database.

I have a many-to-many relation between two entities A and B, represented by a junction DataFrame A2B.columns == ['AId', 'BId']. The uniqueness constraint on As is that each A have a different relation to Bs.

I want to efficiently remove duplicates As based on this constraint. So far, I do it with Pandas like this :

AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index

The conversion to tuples allow the hashing and comparison of BIds collections related to each AId. It's allready a bit slow on a hundredth of our dataset.

  • What am I trying to do, in terms of relational algebra ?
  • Can it be done more efficiently in Pandas or in SQL with the SQLite engine ?

Aucun commentaire:

Enregistrer un commentaire