lundi 5 octobre 2015

How to handle scraping duplicated data?

I'm scraping data from another site, and I frequently deal with a situation as below:

EntityA
    IdEntityB
    IdEntityC

EntityB
    IdEntityD
    IdEntityE

Where each of these entities has its own page, and I would like to insert those into a SQL database. However, the order at which I scrap items is not the optimal one. My solution so far (that didn't deal with foreign key or any kind of mapping) has been to scrap EntityA's page, look for the link to its corresponding EntityB's page and schedule that page to be scraped too. Meanwhile, all the scraped entities get thrown together in a bin and I cluster then to be inserted into the database. For performance reasons, I wait until I have about 2000 entities scraped to flush all of them into the database. The naive approach is to just insert each identity without an unique identity, but that would mean I would have to use some other (non-numeric) lower quality piece of information to reference each entity on the system. How can I guarantee I have clean data in the DB when I can't scrap all of the entities together? This is using Python, with the Scrapy framework.

Aucun commentaire:

Enregistrer un commentaire