mercredi 20 mai 2015

Detecting duplicates in a file

I am attempting to develop an application that would parse in a file (i.e comma,tab delimited) and find duplicate entries. The duplicate entries need to be written to a separate file, with the original input file remaining as is. The problem I am having is I can't come to a decision on how I should actually find those matches ?

Lets assume data is as follows:

id,Firstname,Lastname,Address,Country
1,James,Michael,123 St,USA
2,James,Michae l,123 St,AU
3,Steve,Smith,12445,UK

*The rule is that two records are considered duplicates only if firstname,lastname,address match (keeping in mind that spaces can't be considered in the algorithm)

Here are the questions I am struggling with

  1. This application cant use a centralized server based database, but rather something that is only used every time a new data file is loaded per client's machine (leaving only something like SQLite) OR would it be best not use a database but rather in code ?
  2. If there are 5000000 'records' in a given data file what are some ways I can reduce the time it takes to find the duplicates.
  3. Does the language I use to develop this with factor in at all (excluding development time) ?

Thanks for any advice

Aucun commentaire:

Enregistrer un commentaire