I'm using SqlBulkCopy
class in C# to copy data from one SQL Server database to another in a fast way. The databases are in different servers and their datatables don't have any PK, so the process gets more complicated.
The problem is that the query I'm using to select data from the original database gets duplicate rows and SqlBulkCopy
cannot avoid insert duplicate records in destination database.
I cannot use SELECT *
because it throws an OutOfMemoryException
, so I do SELECT TOP X *
and load that data into a DataTable
. In each Datatable
I can remove the duplicate records using C#, but then when I select the next TOP X, the first row selected may be equal to the last one that was in the previous DataTable
and has been already inserted into the destination database. The DataTable
variable is always the same, it is reloaded!
I want to avoid duplicate records to be inserted without create PK because it's not applicable to my case. I really need to use SqlBulkCopy
because a fast copy is a system requirement. Any suggestion? Thank you in advance!
Don't use C#.
You can right click your origin database in SSMS and choose "Tasks" and then "Generate Scripts". Choose the table you want and use the wizard to generate your insert scripts. Then run these on your second database.
If this action need to be repeated you could set up a Linked Server between your two SQL Server instances, and then write an insert statement from one to the other in a Stored Procedure. You can then run this stored procedure whenever you need, or call it from C#.
If you want it to run regularly you could set up a Job on the database.
Have you considered copying the rows out of the first database onto a file on disk rather than in memory? Then you will be able to get all of them in one go rather than needing to make batches with select top X *
. Once the data is on disk it can be sorted -- perhaps even with an implementation of Unix sort
that handles large files -- and duplicate records removed.
If you want to remove duplicates then at some point you are going to need to have all the data in one place and either sort it or make an index on it. That can be in the first database, in memory, on disk, or in the second database. There are reasons why you don't want to make indexing in either of the databases, and there isn't room for all the data in memory, so that seems to leave spooling it to disk as the only option.
Personally, though, I would think very hard about making a primary key. Although you say it's not applicable, it may be worth having it just to help with data loading.