I use Pandas for pre-processing of CSV dataset and convert it to a SQLite database.
I have a many-to-many relation between two entities A
and B
, represented by a junction DataFrame A2B.columns == ['AId', 'BId']
. The uniqueness constraint on A
s is that each A
have a different relation to B
s.
I want to efficiently remove duplicates A
s based on this constraint. So far, I do it with Pandas like this :
AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index
The conversion to tuples allow the hashing and comparison of BIds
collections related to each AId
. It's allready a bit slow on a hundredth of our dataset.
- What am I trying to do, in terms of relational algebra ?
- Can it be done more efficiently in Pandas or in SQL with the SQLite engine ?
Aucun commentaire:
Enregistrer un commentaire