Processing 1.5 million rows against each other

c# preprocessor sqlbulkcopy sql-server-2008

Question

I need to go through 1.5 million rows, for row I need to calculate a variance against all other rows.

At the moment I'm doing something like:

    // Mocked Data
    var myTable = new DataTable("hello");
    myTable.Columns.Add("ID", typeof (int));
    myTable.Columns.Add("Value", typeof(byte[]));
    var random = new Random();
    for (int i = 0; i < 1000; i++)
    {
        var row = myTable.NewRow();
        row["ID"] = i;
        var bitArray = new BitArray(50 * 50);
        for (int j = 0; j < 50*50; j++)
        {
            bitArray.Set(j, random.NextDouble() >= random.NextDouble());
        }

        byte[] byteArray = new byte[(int)Math.Ceiling((double)bitArray.Length / 8)];

        bitArray.CopyTo(byteArray, 0);

        row["Value"] = byteArray;
        myTable.Rows.Add(row);
    }
    // Mocked data complete.


    var calculated = new DataTable("calculated");
    calculated.Columns.Add("ID", typeof (int));
    calculated.Columns.Add("AaginstID", typeof (int));
    calculated.Columns.Add("ComputedIntersect", typeof(byte[]));
    calculated.Columns.Add("ComputedUnion", typeof(byte[]));
    for (int i = 0; i < myTable.Rows.Count; i++)
    {
        for (int j = i + 1; j < myTable.Rows.Count; j++)
        {
            var row = calculated.NewRow();
            row["ID"] = myTable.Rows[i]["ID"];
            row["AaginstID"] = myTable.Rows[j]["ID"];

            var intersectArray = new BitArray((byte[]) myTable.Rows[i]["Value"]);
            var unionArray = new BitArray((byte[])myTable.Rows[i]["Value"]);
            var jArray = new BitArray((byte[])myTable.Rows[j]["Value"]);


            intersectArray.And(jArray);
            unionArray.Or(jArray);

            var intersectByteArray = new byte[(int)Math.Ceiling((double)intersectArray.Length / 8)];
            var unionByteArray = new byte[(int)Math.Ceiling((double)unionArray.Length / 8)];

            intersectArray.CopyTo(intersectByteArray, 0);
            unionArray.CopyTo(unionByteArray, 0);

            row["ComputedIntersect"] = intersectByteArray;
            row["ComputedUnion"] = unionByteArray;
            calculated.Rows.Add(row);
        }
        // Real data is 1.5m+ rows, so need to do this incrementally
        // STORE DATA TO DB HERE
    }

I store my data using SQLBulkCopy with TableLock on, my BatchSize is default (0 - whole batch). Saving 1.5 millions records to the DB is a little slow (30/60 seconds), so I'm open to suggestions on changing up my SQL storage mechanism too, but the main bottleneck is the C#. My BitArray is 2500 bits in size (I use 50*50 as it's a grid, in the code I allow for a variable grid size, for this test, assume it's always 2500 bits).

To process 1.5 million rows against a single row takes roughly 140 seconds, this would take too long to process everything, so I need to find a better way. This much work is being done in order to pre-process the data for faster retrieval when it counts, so I COULD leave this going for a day, but from my calculations, this would take nearly three years to process...

I store the data to the DB on each loop of the outer loop so that I don't hold too much in memory at once. The data is set up in an unrealistic way, I do use BitArray.Set for the first round of processing (to generate the 1.5 million rows), and that is a bottle neck, but that doesn't need revision. The main goal is to get the Union/Intersection of each row against all others, so that later on I can just pull out the related rows, ready to go. So if there is a better storage type (Binary(313) in the DB), or a better way to get the same result, I'm open to a rewrite.

I have considered writing a SQL CLR function, but I am not sure that is the right approach either. Pre-pocessing the data is required, so I'm looking for help on the best approach.

Popular Answer

I would suggest doing all your computing in the database. SQL Server is best at set based operations which makes it perfect for theses types of problems. I brief idea of the steps you could take

  1. bcp all your data into a temporary table.
  2. Update the temporary table with the computed values you need.
  3. Insert into your "real" table selecting the values you want from the temporary table.


Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why