Import a lot of xml files

c# sqlbulkcopy sql-server-2008 xml

Question

What is the best approach to import multiple xml files which are in very big number (ex: 30000) with different schema to Sql Server 2008?

I am currently looping through each file, loading data to datatable and using SqlBulkCopy to insert the data, but it is taking a lot of time (about 1.5 hours).

Accepted Answer

This shouldn't take so long. By my estimate, you've got around 600MB of data; you should be able to approach 10MB/s or at least 1MB/s without much difficulty - this means 1-10 minutes should be easily achievable.

What to do:

  • Identify what part is taking so long
  • Make a shortlist of possible bottlenecks
  • Test each theory until you found the culprit
  • Implement workaround.

Without more details, it's hard to be precise, but I can speculate:

  • SqlBulkCopy is usually fast, so your insert is likely not the bottleneck. You could do a little faster than a datatable, but it's probably not an issue.
  • DataTables can have "indexes"; i.e. primary keys and constraints. These are implemented very inefficiently - these could definitely cause problems.
  • Although SqlBulkCopy is fast, it's best with many rows. If you're copying just 1 file per SqlBulkCopy, that means 30000 calls, and probably at least 30000 fsyncs on the database side. You should be using only one SqlBulkCopy.
  • Your database may have indexes or constraints on the rows you've inserted; and these can severely limit performance - check database load.
  • most xml readers in .NET are reasonably fast, but 10MB/s on a single thread could be pushing it. If your access pattern is particularly bad (e.g. load a full XmlDocument and query it with lots of inefficient loops and/or XPath) you might be running into CPU load issues.
  • 20KB files are fairly small. If you're reading these off an HDD, to achieve just 1MB/s you'd need to read 50 a second, i.e. with 20ms latency. That might be a problem, particularly if the files are non-sequentially located on the disk. Parallelization might help a little, but an SSD will help more (and better would be both). However, you're seeing performance much lower than this, so this is probably not your primary bottleneck.

With that in mind, I'd look at the following in this order

  • Disk load on the client (if disk wait time is high, consider an SSD and concurrency)
  • CPU load - profile (inefficient parsing or misused DataTables)
  • SqlBulkCopy instances - use only one (per thread)
  • Database load (get rid of any indexes you can, including if possible clustered indexes)

The order is inspired by how hard it is to check for this problem. Disk load is unlikely to be problematic, but it's trivial to check, so you might as well start by eliminating that possibility. Database schema issues aren't that unlikely, but they're much more work to identify (which index is it, and am I impacting another workflow by removing it?) so I'd check those last.



Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why
Licensed under: CC-BY-SA with attribution
Not affiliated with Stack Overflow
Is this KB legal? Yes, learn why