How to properly avoid insert duplicate rows in SQL Server data tables which don't have any PK?

bulk c# sql sqlbulkcopy sql-server

Question

I'm using SqlBulkCopy class in C# to copy data from one SQL Server database to another in a fast way. The databases are in different servers and their datatables don't have any PK, so the process gets more complicated.

The problem is that the query I'm using to select data from the original database gets duplicate rows and SqlBulkCopy cannot avoid insert duplicate records in destination database.

I cannot use SELECT * because it throws an OutOfMemoryException, so I do SELECT TOP X * and load that data into a DataTable. In each Datatable I can remove the duplicate records using C#, but then when I select the next TOP X, the first row selected may be equal to the last one that was in the previous DataTable and has been already inserted into the destination database. The DataTable variable is always the same, it is reloaded!

I want to avoid duplicate records to be inserted without create PK because it's not applicable to my case. I really need to use SqlBulkCopy because a fast copy is a system requirement. Any suggestion? Thank you in advance!

Accepted Answer

Don't use C#.

You can right click your origin database in SSMS and choose "Tasks" and then "Generate Scripts". Choose the table you want and use the wizard to generate your insert scripts. Then run these on your second database.

If this action need to be repeated you could set up a Linked Server between your two SQL Server instances, and then write an insert statement from one to the other in a Stored Procedure. You can then run this stored procedure whenever you need, or call it from C#.

If you want it to run regularly you could set up a Job on the database.


Popular Answer

Have you considered copying the rows out of the first database onto a file on disk rather than in memory? Then you will be able to get all of them in one go rather than needing to make batches with select top X *. Once the data is on disk it can be sorted -- perhaps even with an implementation of Unix sort that handles large files -- and duplicate records removed.

If you want to remove duplicates then at some point you are going to need to have all the data in one place and either sort it or make an index on it. That can be in the first database, in memory, on disk, or in the second database. There are reasons why you don't want to make indexing in either of the databases, and there isn't room for all the data in memory, so that seems to leave spooling it to disk as the only option.

Personally, though, I would think very hard about making a primary key. Although you say it's not applicable, it may be worth having it just to help with data loading.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why