I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing. I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.
The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.
Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.
The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.
Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?
I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.
For example, you can create a (non-unique) index on the staging table to deal with the "key"
What is the data volume? You have 2 options that I can see:
1: filter it at source, by implementing your own
IDataReader and using some hash over the data, and simply skipping any duplicates so that they never get passed into the TDS.
2: filter it in the DB; at the simplest level, I guess you could have multiple stages of import - the raw, unsanitised data - and then copy the
DISTINCT data into your actual tables, perhaps using an intermediate table if you want to. You might want to use
CHECKSUM for some of this, but it depends.