I am attempting to develop an application that would parse in a file (i.e comma,tab delimited) and find duplicate entries. The duplicate entries need to be written to a separate file, with the original input file remaining as is. The problem I am having is I can't come to a decision on how I should actually find those matches ?
Lets assume data is as follows:
id,Firstname,Lastname,Address,Country
1,James,Michael,123 St,USA
2,James,Michae l,123 St,AU
3,Steve,Smith,12445,UK
*The rule is that two records are considered duplicates only if firstname,lastname,address match (keeping in mind that spaces can't be considered in the algorithm)
Here are the questions I am struggling with
- This application cant use a centralized server based database, but rather something that is only used every time a new data file is loaded per client's machine (leaving only something like SQLite) OR would it be best not use a database but rather in code ?
- If there are 5000000 'records' in a given data file what are some ways I can reduce the time it takes to find the duplicates.
- Does the language I use to develop this with factor in at all (excluding development time) ?
Thanks for any advice
Aucun commentaire:
Enregistrer un commentaire