I use Pandas for pre-processing of CSV dataset and convert it to a SQLite database.
I have a many-to-many relation between two entities A and B, represented by a junction DataFrame A2B.columns == ['AId', 'BId']. The uniqueness constraint on As is that each A have a different relation to Bs.
I want to efficiently remove duplicates As based on this constraint. So far, I do it with Pandas like this :
AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index
The conversion to tuples allow the hashing and comparison of BIds collections related to each AId. It's allready a bit slow on a hundredth of our dataset.
- What am I trying to do, in terms of relational algebra ?
- Can it be done more efficiently in Pandas or in SQL with the SQLite engine ?
Aucun commentaire:
Enregistrer un commentaire