Bulk insert strategy from c# to SQL Server

bulkinsert c# sqlbulkcopy sql-server

Question

Customers will send our system a set of complex/nested messages as part of our ongoing project. These communications arrive between 1000 and 2000 times each second on average.

These intricate items include both master data and transaction data (to be added) (which will be added if not found). However, the customer passes the "name" column rather than the IDs of the master data.

The system verifies if master data for these names exist. If the IDs are already present, they are used; otherwise, they are created first and then used.

The system inserts the transactional data into a SQL Server database when master data ids have been resolved (using master data ids). There are typically 15 to 20 master entities per message.

The following are some tactics that we may use.

  1. We may first use our C# code to resolve master ids (and, if necessary, insert master data), then store these ids in the C# cache. Using bulk insert, we may add the transactional data after all ids have been resolved.SqlBulkCopy class. We may query the database fifteen times to get the ids for various things, and then query it one more to add the last bit of information. After completing all of this processing, we may utilize the same connection and then shut it.

  2. The database may receive all of these messages comprising master data and transactional data in a single shot (as multiple TVP), and then, within a stored procedure, the master data can be created first for the missing ones before the transactional data is inserted.

Could someone advise on the best course of action in this use case?

I am unable to reveal the real object structure due to privacy concerns. However, this hypothetical object structure is extremely similar to the object of our firm..

One such message will include data from several suppliers on a single product (its master data) and its pricing information (transaction data):

the main data (which need to be added if not found)

Product name: ABC, Category: XYZ, Manufacturer: XXX, and a few further information (number of properties are in the range of 15-20).

Financial information (which will always be added)

ListPrice: XXX, Discount: XXX, Vendor Name: A

ListPrice: XXX, Discount: XXX, Vendor Name: B

ListPrice: XXX, Discount: XXX, Vendor Name: C

ListPrice: XXX, Discount: XXX, Vendor Name: D

For a message belonging to a single product, the majority of the master data will stay the same (and change less often), while transaction data will constantly change. As a result, the system will determine whether or not the product "XXX" is present. If not, it verifies if the "Category" associated with this product is real or not. If not, a new record for category and subsequently one for product will be inserted. For Manufacturer and other master data, this will be done.

At the same time, data regarding 2000–5000 goods will be sent by numerous sellers.

Let's say we have 1000 suppliers, and each one is supplying information on 10 to 15 distinct items. Every seller offers us the most recent prices for these 10 goods every 2-3 seconds. He could begin to provide information on new items, but it won't happen often.

1
7
2/23/2015 9:56:36 PM

Popular Answer

Your second option would probably be the best for you (i.e. sending all of the 15 - 20 entities to the DB in one shot using multiple TVPs and processing as a whole set of up to 2000 messages).

Although it sounds excellent, caching master data lookups at the application layer and translating before sending to the database are lacking:

  1. To retrieve the first list, you will need to access the database.
  2. To add new items, you will need to access the database.
  3. A database searches for values in a dictionary to replace them with IDs (assume a Non-Clustered Index on each of these name-to-ID lookups)
  4. Datapages for frequently accessed values will be cached in the buffer pool (which is a memory cache)

Why deliver what is previously offered and taking place right now at the DB layer twice, particularly given:

  • The 15–20 entities may contain up to 20,000 records, which is a manageable amount given that the Non–Clustered Index only requires two fields:Name and ID which, when utilizing a 100% Fill Factor, may fit multiple rows into a single data page).
  • You don't need to bother about storing every one of the 20k entries since not all of them are "active" or "current." As a result, whatever values are now in use can be readily distinguished from the values being retrieved, and the those data pages—which can have some inactive entries, but that's okay—are the ones that will be cached in the buffer pool.

Therefore, you don't have to be concerned about imposing key expirations or reloads owing to potentially changing values (i.e. updated values) or aging out old entries.Name for a certainID ) because that is dealt with organically.

In-memory caching is a terrific technique that speeds up websites significantly, however such situations / use-cases are for when non-database processes repeatedly access the same data for read-only reasons. However, in this instance, data is being combined, and the list of lookup values may change regularly (moreso due to new entries than due to updated entries).


Given everything, the second option is the best. Despite not using 15 TVPs, I have used this strategy successfully several times. The approach may require some optimizations or modifications to be tuned for this specific circumstance, but what I have found to work effectively is:

  • Via TVP, accept the data. I really like this toSqlBulkCopy because:
    • It results in a Stored Procedure that can be easily self-contained.
    • The ability to completely stream the collection(s) to the DB without first having to copy them to a database fits in with the app's code extremely well.DataTable first, which duplicates the collection and uses up memory and CPU resources. You must do this by developing a function that returns data for each collection.IEnumerable<SqlDataRecord> takes the collection as input, accepts it asyield return; a message for each entry in thefor or foreach loop.
  • Although this may be reduced by utilizing a TVP, TVPs are not excellent for statistics and are hence not ideal for joining toTOP (@RecordCount) However, as they are just used to fill in any missing information from the actual tables, you don't need to worry about that anyhow.
  • Step 1: Add the names that are lacking for each object. Keep in mind that the NonClustered Index should be[Name] field for each object, and if the ID is the Clustered Index, then that value will automatically be a part of the index.[Name] will assist the subsequent procedure and also offer a covering index. Also keep in mind that the data pages for these indexes will stay cached in the Buffer Pool for any earlier executions for this client (i.e., essentially the same entity values) (i.e. memory).

    ;WITH cte AS
    (
      SELECT DISTINCT tmp.[Name]
      FROM   @EntityNumeroUno tmp
    )
    INSERT INTO EntityNumeroUno ([Name])
      SELECT cte.[Name]
      FROM   cte
      WHERE  NOT EXISTS(
                     SELECT *
                     FROM   EntityNumeroUno tab
                     WHERE  tab.[Name] = cte.[Name]
                       )
    
  • Step 2: Type each "message" in basic text.INSERT...SELECT where Step 1 has already caused the data pages for the lookup tables (i.e., the "entities") to be cached in the Buffer Pool.


Remember that testing cannot be replaced by guesswork, assumptions, or informed guesses. Since there could be other factors that have not been discussed that might affect what is deemed "perfect" here, you should test a few approaches to discover which one works best for your unique circumstance.

I'll suggest that Vlad's approach could be quicker if the Messages are insert-only. I have utilized the procedure I'm going to describe here in more complicated circumstances that called for complete synchronization (updates and deletes), extra validations, and the production of relevant operational data (not lookup values). UsingSqlBulkCopy However, this presupposes you are loading directly to the target tables (messages and lookups) and not into intermediate / staging tables (and I think Vlad's aim is to load directly to the destination tables).SqlBulkCopy straight access to the final tables). However, as was already said, the problem of updating lookup values makes utilizing an external cache (rather than the Buffer Pool) more error prone. To account for invalidating an external cache could require more code than it is necessary, particularly if doing so just slightly improves performance. The solution that is ultimately preferable for your requirements must take into account the added risk and upkeep.


UPDATE

According to information in comments, we now know:

  • There are several Vendors
  • Each Vendor offers a variety of Products.
  • Products are offered by one or more vendors; they are not exclusive to one vendor.
  • Product characteristics are unique.
  • Information about prices has features that allow for numerous recordings.
  • Only insert pricing information (i.e. point-in-time history)
  • SKU is what determines Unique Product (or similar field)
  • A product that has already been developed but has different attributes (such as category, manufacturer, etc.) will be treated as same Product and the differences won't be taken into account.

In light of everything said above, I will continue to advise TVPs, but to reconsider their strategy and make it vendor-centric rather than product-centric. Here, it is assumed that vendors transmit files whenever they like. Import the file you get. The Vendor would be the only search you would do in advance. The fundamental design is as follows:

  1. Why would the system import a file from an unknown source at this point? It seems logical to presume that you already have a VendorID.
  2. You may import in groups.
  3. Develop aSendRows approach that:
    • takes a FileStream or other mechanism for iterating over files.
    • accepts items likeint BatchSize
    • returns IEnumerable<SqlDataRecord>
    • produces aSqlDataRecord to coincide with the TVP structure
    • until either BatchSize is fulfilled or there are no more entries in the File, the for loop iterates across the FileStream.
    • carry out any required data validations.
    • connect the data toSqlDataRecord
    • call yield return;
  4. Start the file.
  5. While the file contains data
    • dial the saved procedure
    • enter VendorID
    • move inSendRows(FileStream, BatchSize) inside the TVP
  6. Save the file.
  7. Play around with:
    • Before beginning the loop over the FileStream, the SqlConnection is opened. It is then closed.
    • Inside the FileStream loop, open the SqlConnection, run the stored procedure, and then close the SqlConnection.
  8. Try out different BatchSize settings. Start with 100 and increase by 200, 500, etc.
  9. New Product insertion will be handled by the stored procedure.

You will be sending in Product characteristics that are not utilized if you use this sort of structure (i.e. only the SKU is used for the look up of existing Products). BUT there is no file size upper limit, thus it scales quite well. It's okay if the vendor supplies 50 products. Fine if they send 50,000 products. The system I worked on handled updating Product information that was changed for any of its characteristics if they sent 4 million Products, so that's good. No more RAM was added to the database or application layers to accommodate even 10 million Products. The volume of Products delivered should be accompanied by an increase in import processing time.


REVISED 2
New information on the data's source:

  • originating from Azure EventHub
  • is presented as C# objects (no files)
  • Product information is received using the O.P. system's APIs.
  • is collected in single queue (just pull data out insert into database)

Since you may transfer C# objects through as-is using the way I mentioned in my initial update (i.e., a method that returns a value), I would most certainly utilize TVPs if the data source is C# objects.IEnumerable<SqlDataRecord> Send one or more TVPs for the Price/Offer per Vendor information, but use conventional input parameters for the characteristics of a single property. For instance:

CREATE PROCEDURE dbo.ImportProduct
(
  @SKU             VARCHAR(50),
  @ProductName     NVARCHAR(100),
  @Manufacturer    NVARCHAR(100),
  @Category        NVARCHAR(300),
  @VendorPrices    dbo.VendorPrices READONLY,
  @DiscountCoupons dbo.DiscountCoupons READONLY
)
SET NOCOUNT ON;

-- Insert Product if it doesn't already exist
IF (NOT EXISTS(
         SELECT  *
         FROM    dbo.Products pr
         WHERE   pr.SKU = @SKU
              )
   )
BEGIN
  INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...)
  VALUES (@SKU, @ProductName, @Manufacturer, @Category, ...);
END;

...INSERT data from TVPs
-- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows
2
2/28/2015 9:03:29 PM


Related Questions





Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow