Sensor Data and Distributed Logging with Timeseries

Data Series

There is a lot of data series products available at the current time. Most of them are good enough and have integrations with a lot of external products.

Some time ago, I was actually looking into something to integrate with an observability frontend, something like Grafana. It came to my attention that there is actually very good support for timeseries in Redis.

Build or Buy

Just when this was happening, the company behind Redis which currently has nothing to do with its original creator, decided to switch licensing model.

Still Redis would be viable for a single product, but the ongoing fragmentation of the ecosystem, together with the uncertainty related to the change in the licensing model, made less appealing to sprinkle some custom code on top of an existing product.

Plus, if I need to put code in a project, it'd better be code that I fully understand and have a decent degree of control over.

Also, as a long-time Apache Cassandra user, I like the idea that data should not have a single point of failure, and geographically distributed data persistenc is something that adds value in itself.

For some time I had been thinking that maybe Cassandra wasn't the best storage for data series because of the way it indexes the data, but in the end I realize it was just me not fully grasping the power and inner workings of the partitioning/clustering keys and how to properly use them to achieve my targets.

It wasn't much before stumbling onto this very short whitepaper titled: Getting Started with Time Series Data Modeling which has a paragraph simply titled: "Cassandra is awesome at time series".

This was enough to give a stab at the issue.

This is how Yats was born. The first version exposes a very simple Rest endpoint. I expect to iterate over the transport layer by adding a gRPC/Protobuf interface for lower data usage, but it's not the priority as of now.

Operational Costs

Another cost optimization is that - while Cassandra is simply amazing at data persistence and availability, storing the full story of any project can get expensive quickly.

Therefore Yats has a concept of 'hot data' that can be queried online directly in Cassandra, and 'cold data' that is swapped out from the database out to Parquet files, potentially stored away over S3 or http protocol.

Yats will do this for you while running with the desired archival policy. By now a monthly policy is implemented, where metrics and logs for the last closed month are swapped out to Parquet files into a local archive.

Other policies will be available in the future and pluggable based on usage patterns.

Yats Features

  • Rest interface for logging, events and metrics
  • Distributed Timeseries Server
  • Datalake Syncing
  • Strong authentication with mTLS

Use cases

  • Sensor logging
  • Distributed logging

Source Code

Source code is available from this URL

Licensing

The code is freely available under the Affero GPL License see: COPYING

Additional commercial support, custom builds and licensing are available on request. Just issue a support request and mention you are interested in Yats



[cassandra] [git]