Cassandra as distributed NoSQL cloud database across multiple data centers
Demand for cloud hosting capacity and availability climbs every day. Throughout this growth, numerous, geographically-separate hosting locations are critical, and multiple data centers supporting a single cloud hosting platform are commonplace.
Distributed hosting services present a new challenge: the need to have a single, easily-accessible system for storage, replication, and aggregation of platform usage, performance, and metric data.
To address this need, we implemented Apache Cassandra, a highly-scalable distributed database, configured with nodes located in several data centers across the world.
The Use Case and Reasoning
We chose to implement a distributed data storage system with Cassandra because it could act as a performant, scalable, redundant, highly fault-tolerant and decentralized system spanning multiple data centers with no single point of failure.
We needed the system to facilitate large-volume storage of infrastructure based usage data, captured in real-time and used by an aggregation system to generate billable usage on a per-customer basis.
This data is critical to our daily operation, so it was a primary concern to minimize the risk that any system malfunction, including a situation like catastrophic data center loss, could result in the loss of this data which must be retained for legal reasons.
Due to the extremely large volume of data and the rate of data accrual, the system needed to be responsive enough to allow for the constant collection of data, as well as facilitate efficient data access needed by daily processes that aggregate usage, in a timely manner, so it is accessible by customers.
Some Downsides
The disadvantages of using Cassandra include a lack of flexibility in accessing data, as compared to a relational database, and also the lack of a reliable method for removing data after it’s added to the system.
For our use case, the downsides were less concerning as we addressed the data access limitations programmatically using Python for calculation and record sorting. We were not concerned with the deletion of data since data would only be inserted, updated, and accessed.
Furthermore, we needed the ability to archive older data, which Cassandra does have easily-implemented mechanisms for doing, as long as the system is configured and used in a specific manner (including column families and keyspaces).
Cassandra Nodes:
We opted to create three Cassandra nodes per data center.
- A Cassandra Cluster consists of multiple nodes, spanning multiple data centers. Each node runs its own instance of the Cassandra server and uses its local disk for data storage.
- A node can also be used for data access so having more nodes allows for more efficient data access.
- Nodes can be added at any time if more capacity is needed.
- If any single node, or even multiple nodes, become inaccessible the cluster is still able to operate with almost zero impact. This is a huge benefit when having a cluster that spans multiple data centers, as communication between data centers can be impacted for a number of reasons, and the system needs to expect and handle this situation.
- If a node goes down due to network or system issues, as long as it comes back within a reasonable amount of time, it will rejoin the cluster and even recover any data that was added to the cluster while it was inaccessible. This minimizes the impact from periodic network blips that could break replication between less reliable distributed systems.
Data center awareness:
Since data centers can become isolated, our use case required that each data center could operate independently if necessary.
The single most important concept when setting up a Cassandra cluster spanning multiple data centers is using the correct Cassandra load balancing policy. This policy changes the method used when accessing data from the cluster. The default policy is called ‘RoundRobinPolicy’ which tells the system to equally distribute data access calls among all available nodes in the cluster. In a cluster spanning multiple data centers, this can be very bad.
As an example: You may be accessing data from the western US but data may get pulled from Europe, which will result in the correct data since it is replicated to all nodes, but can run into slower data access and possible network timeouts.
Instead, use ‘DCAwareRoundRobinPolicy.’ This policy ensures when data is accessed from a data center, it is retrieved from the nodes local to that data center, providing the fastest possible data access path. If the local nodes don’t have the data being accessed, the policy will go to nodes in other data centers, but only as a last resort.
Here’s an example of how we configured ‘DCAwareRoundRobinPolicy’ using Python:
from cassandra.cluster import Cluster from cassandra.policies import DCAwareRoundRobinPolicy cluster = Cluster(cass_db_hosts) lb_policy = DCAwareRoundRobinPolicy(datacenter_name) cluster.load_balancing_policy = lb_policy session = cluster.connect()
Data center awareness can also be used to configure special-purpose nodes. For example, we have a single node configured as its own data center so it is used for storage, but not data access. The server running this node is configured with regular scheduled backups, both local to the data center and remote. This serves as one more safe guard against any catastrophic issue causing the loss or corruption of our valuable data within the cluster. A backup of this single node could be used to rebuild the entire cluster. (A situation we hope never arises.)
Keyspace configuration:
A Cassandra keyspace is a namespace used for defining data replication on nodes and used to logically segregate a set of column families. (Generally, column families are considered synonymous to tables in traditional database terminology.)
- Cassandra offers a lot of configuration and tuning at a keyspace level, including defining the data centers used by the keyspace and number of data replicas to be created in each data center.
- Having multiple keyspaces is a useful way to segregate data that does not need to be stored or accessed in tandem. We used a separate keyspace for each month worth of usage data, because all usage data is aggregated on a monthly basis and there is little need to access data spanning these boundaries at the database level.
- This approach also reduces the overall number of records each column family must store, making periodic maintenance or data recovery (due to a system issue) easier and less time-consuming.
- There are multiple tools for manipulating Cassandra data at a per keyspace level. This makes certain optimization or data archival processes much more manageable for a very large data set.
Here’s a brief example script demonstrating how we create keyspaces:
--Data centers include: US1, US2, EU1, EU2 CREATE KEYSPACE produsage_201701 WITH replication = { 'class': 'NetworkTopologyStrategy', 'US1: '3', 'US2': '3', 'EU1': '3', 'EU2': '3' };
Outcomes
Since implementation of our multiple data center cluster, we have seen many of the expected benefits, including reliable, fast data storage and access, and the ability to recover from both minor and major incidents causing isolation of data centers with minimal manual intervention. This has addressed a number of issues with our previous system and resulted in a much more reliable system for our customers to access their usage and metric data.
There are a vast number of configuration and tuning options with Cassandra, making it an ideal solution for many diverse use cases. Please share your suggestions and thoughts on other implementations in the comments.