We are running a database backed web application for data analysis, currently based on C#.NET with EntityFramework on the server, and mainly HTML+Javascript frameworks on the client side (web based).
Our application regularly receives measurement X/Y datapoints batched in large amounts, i.e. 1e6 or more, uploaded by users or received by other infrastructure.
Currently we have a table in MSSQL called Values
with id, series_id as int; x, y, z as float
. This table is BULK INSERT
filled with data when a client uploads it, and the associated metadata is saved in a Series
table. The total db size is currently approaching 1TB, 99.99% of which is Values
data.
This approach was simple to implement but it has several drawbacks which have made it complicated and slow over time:
- we have to insert in chunks to not overload the IIS process which preprocesses it (currently 200'000 datapoints per chunk)
- IIS process memory requirements during INSERT are huge (>1500MB for 200MB of data)
- inserting is far too slow (5 million records are 100MB, this takes >30 seconds to insert even using BULK INSERT)
- during INSERT the whole table is locked, i.e. only one user can insert at a time
- retrieving the data is also quite slow, requesting 1e6 records sometimes takes >10 seconds
- deleting series with >1e6 records regularly causes timeouts on the web app side.
The data is never partially selected, so we don't really need to have it in a table. BUT it is 'thinned out' for display before sent to clients, i.e. 1e6 records are - by default, i.e. in 99% of use cases - reduced to 2000 or 10'000 records before sent to the client. This is cached on the client, but if a new client requests the same set, it's processed again. The Values table also has an index on series_id
which takes more disk space than the table itself.
I am wondering whether it would make sense to change this storage format to a BLOB storage in "Values" with its own data format (CSV or JSON or binary), and - maybe - additional columns with preprocessed 'reduced' datasets for display which can be pushed to the clients without change (eg. in JSON). So the new Values
table format would be something like
id, series_id, data(blob), reduced_data(blob)
and there'd be just one Value
per Series
entry, not 1e6 or more. The reduced dataset would be created once when uploaded data is received and then used for display when clients request it
I will lose partial selects of values
by ID or X/Y value, but Values are never SELECTed based on anything other than id
or series_id
so this is currently not a limitation. So here are my questions:
- Does this make sense at all? I expect creation and deletion of a large BLOB dataset to be always significantly faster than creation and deletion of 1,000,000 single records. True?
- Binary BLOB or CSV/JSON/.. BLOB? The simplest approach for the BLOB storage is of course to create a huge CSV or JSON chunk and save it (possibly gzipped) in the database. A custom binary data format would be even smaller, but it would have to be converted to JSON before sent to the clients.
I have a feeling the additional hassle coming with binary data formats may not be worth it and it's better to gzip the CSV/JSON blob than to invent a binary format. True?
How about other drawbacks of BLOBs that I may not even be aware of? Size limitations don't seem to be an issue, varbinary(MAX)
is sufficient. I don't need an index on the values inside the blob, just on the metadata (which is in the Series table).
Thoughts?