At which point does storing a large amount of structured data as BLOB make sense?

Question 1

We are running a database backed web application for data analysis, currently based on C#.NET with EntityFramework on the server, and mainly HTML+Javascript frameworks on the client side (web based).

Our application regularly receives measurement X/Y datapoints batched in large amounts, i.e. 1e6 or more, uploaded by users or received by other infrastructure.

Currently we have a table in MSSQL called Values with id, series_id as int; x, y, z as float. This table is BULK INSERT filled with data when a client uploads it, and the associated metadata is saved in a Series table. The total db size is currently approaching 1TB, 99.99% of which is Values data.

This approach was simple to implement but it has several drawbacks which have made it complicated and slow over time:

we have to insert in chunks to not overload the IIS process which preprocesses it (currently 200'000 datapoints per chunk)
IIS process memory requirements during INSERT are huge (>1500MB for 200MB of data)
inserting is far too slow (5 million records are 100MB, this takes >30 seconds to insert even using BULK INSERT)
during INSERT the whole table is locked, i.e. only one user can insert at a time
retrieving the data is also quite slow, requesting 1e6 records sometimes takes >10 seconds
deleting series with >1e6 records regularly causes timeouts on the web app side.

The data is never partially selected, so we don't really need to have it in a table. BUT it is 'thinned out' for display before sent to clients, i.e. 1e6 records are - by default, i.e. in 99% of use cases - reduced to 2000 or 10'000 records before sent to the client. This is cached on the client, but if a new client requests the same set, it's processed again. The Values table also has an index on series_id which takes more disk space than the table itself.

I am wondering whether it would make sense to change this storage format to a BLOB storage in "Values" with its own data format (CSV or JSON or binary), and - maybe - additional columns with preprocessed 'reduced' datasets for display which can be pushed to the clients without change (eg. in JSON). So the new Values table format would be something like

id, series_id, data(blob), reduced_data(blob)

and there'd be just one Value per Series entry, not 1e6 or more. The reduced dataset would be created once when uploaded data is received and then used for display when clients request it

I will lose partial selects of values by ID or X/Y value, but Values are never SELECTed based on anything other than id or series_id so this is currently not a limitation. So here are my questions:

Does this make sense at all? I expect creation and deletion of a large BLOB dataset to be always significantly faster than creation and deletion of 1,000,000 single records. True?
Binary BLOB or CSV/JSON/.. BLOB? The simplest approach for the BLOB storage is of course to create a huge CSV or JSON chunk and save it (possibly gzipped) in the database. A custom binary data format would be even smaller, but it would have to be converted to JSON before sent to the clients.

I have a feeling the additional hassle coming with binary data formats may not be worth it and it's better to gzip the CSV/JSON blob than to invent a binary format. True?

How about other drawbacks of BLOBs that I may not even be aware of? Size limitations don't seem to be an issue, varbinary(MAX) is sufficient. I don't need an index on the values inside the blob, just on the metadata (which is in the Series table).

Thoughts?

Question 2

One of the main advantages for storing files in Db is ACID (Atomicity, Consistency, Isolation, Durability) technology in DBs. This enables us to securely insert all the data into different tables when we work on the data. When you stored your files in DB as a BLOB you have an advantage when copying files to other storage, because BLOB is faster than operation File System, and you can easily backup your files. But, if your file sizes for each record are more than 10-50-100 MB that is not recommended to store files in the BLOB. In this case duration of inserting records to the table maybe take 10-15-30 seconds. It's not very well, because the transactions last so long either, and all the tables during the transaction in that sense are blocked as you know, and the long blocking of these tables will also cause users to be unable to work. One of the interesting ways of storing files as BLOB is FILESTREAM BLOB technology which is supported only by SQL Server. You can get more information for this technology from this

Ramin Faracov · Answer 1 · 2021-11-18T21:34:15

One of the main advantages for storing files in Db is ACID (Atomicity, Consistency, Isolation, Durability) technology in DBs. This enables us to securely insert all the data into different tables when we work on the data. When you stored your files in DB as a BLOB you have an advantage when copying files to other storage, because BLOB is faster than operation File System, and you can easily backup your files. But, if your file sizes for each record are more than 10-50-100 MB that is not recommended to store files in the BLOB. In this case duration of inserting records to the table maybe take 10-15-30 seconds. It's not very well, because the transactions last so long either, and all the tables during the transaction in that sense are blocked as you know, and the long blocking of these tables will also cause users to be unable to work. One of the interesting ways of storing files as BLOB is FILESTREAM BLOB technology which is supported only by SQL Server. You can get more information for this technology from this

Thanks for the pointer. I benchmarked INSERTs of multiple blocks of 50MB of data into an example table on our production server and each took 1-2 seconds which is acceptable. But FILESTREAM BLOB is a good idea too.

At which point does storing a large amount of structured data as BLOB make sense?

In other languages

This page is in other languages

Popular in the category