BulkCopy with relational data; fast inserts

insert sql sqlbulkcopy sql-server

Question

I have a large amount of constantly incoming data (roughly 10,000 a minute, and growing) that I want to insert into a database as efficiently as possible. At the moment I'm using prepared insert statements, but am thinking of using the SqlBulkCopy class to import the data in larger chunks.

The problem is that I'm not inserting into a single table - elements of the data item are inserted into numerous tables, and their identity columns are used as foreign keys in other rows that are inserted at the same time. I understand that bulk copies aren't meant to allow for more complex inserts like this, but I wonder if it is worth exchanging my identity columns (bigints in this case) for uniqueidentifier columns. This will allow me to do a couple of bulk copies for each table, and since I can determine the IDs before the insert, I don't need to check for anything like SCOPE_IDENTITY which is preventing me from using bulk copy.

Does this sound like a viable solution, or are there other potential issues I might face? Or, is there another way I can insert data quickly, but retain my use of bigint identity columns?

Thanks.

1
2
2/17/2011 2:37:02 PM

Accepted Answer

It sounds like you are planning on exchanging "SQL assigns a [bigint identity() column] surrogate key" with a "data prep routine assings a GUID surrogate key" methodology. In other words, the key will not be assigned within SQL, but from outside SQL. Given your volumes, if the data-generating process can assign surrogate key, I'd definitely go with that.

The question then becomes, must you use GUIDs, or can your data-generation process produce auto-incrementing integers? Creating such a process that works consistantly and infallibly is hard (one reason why you pay $$$ for SQL Server), but the trade-off for smaller and more human-legible keys within the database might be worth it.

1
2/17/2011 2:56:48 PM

Popular Answer

uniqueidentifier will probably make things worse: page splits and wider. See this

If your load is/can be batched, one options is to:

  • you load a staging table
  • load the real tables in one go as a stored procedure
  • use a uniqueidentifier in the staging table for each batch

We deal with peaks of around 50k rows per second (and increasing this way). We actually use a separate staging database to avoid double transaction log writes)



Related Questions





Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow