Bulk insert with data validation

bulkinsert sqlbulkcopy sql-server

Question

We have a requirement to insert large number of records (about 2 to 3 millions) in a table. However, we should be able to validate and segregate invalid records - primary key, foreign key and non null violations - into a separate error table for later reference. Per my study, bulk insert in SQL server works well for inserting, but I'm not able to figure out the best way to filter out bad data records. Does having having a staging table in between help? Although we could check for violations using some queues against staging table, we must load the good records into the actual table with another insert - either through insert select or merge - but is this an efficient approach? I'm concerned as it would be akin to doing 2x inserts.

I'm planning to use .net sqlbulkcopy for doing bulk insertions and it doesn't have a clear error reporting as well.

Can somebody point me to a more efficient solution?

EDIT: If this approach is the only solution, what method do you think is best for the second insert? Is it insert...select or MERGE? Would they match the efficiency and speed of BULK INSERT? Or is there any other better alternative?

Thanks!

Popular Answer

Personally I would not consider 2/3M records as a large amount. Unless you need the data in seconds, A Single (Non-Bulk) insert will perform adequately.

If I'm nervous about the src data quality - I like to load to a stg table first and then do "Soft RI" - Check for PKs, UQs, FKs etc using SQL. If I'm worried about Numeric/non-Numeric or bad date type issues then I make the Stg table VARCHAR(8000) for all cols and use TRY_CONVERT when reading from the table.

Once data is in STG you can easily filter only the good rows and report in detail on the bad rows.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why