• Login
    • Proximity
    • ClearDB
    • ServiceNow
    • NaviVue
      (Formerly Velocity Zoom)
    • Privo Service Desk
  • Support
  • Contact Us
  • Login
    • Proximity
    • ClearDB
    • ServiceNow
    • NaviVue
      (Formerly Velocity Zoom)
    • Privo Service Desk
  • Support
  • Contact Us
  • Industries
    • Cannabis
    • Healthcare
    • Life Sciences
    • Manufacturing
    • ISV/SaaS
  • Services
    • Application Services
      • Oracle
      • SAP
      • Microsoft
      • Infor
      • Salesforce
      • Custom Application Development
    • Cloud Marketplaces
      • AWS
      • Azure
      • Heroku
    • Cloud Services
      • AWS
      • Google Cloud
      • Microsoft Azure
      • Oracle Cloud
      • Cloud Migration
      • Cloud Optimization
      • Cloud DevOps
      • Virtual Desktops
    • Data Intelligence & Automation
      • Business Intelligence
      • Blockchain
      • CPM
      • Data Architecture & Design
      • Predictive Analytics & AI
      • Robotic Process Automation
      • SAP Analytics
    • Database Services
      • Managed DBA
      • SAP HANA
      • Database Refactoring
      • Database as a Service
    • Infrastructure Services
      • Managed Hosting
      • IBM i Power Systems (AS/400)
      • Colocation
      • Disaster Recovery
    • Security Services
      • Advisory Services
      • Managed Security Services
      • Virtual CISO
    • Supply Chain
  • Resources
    • Blog
    • Resource Center
    • Events
    • Case Studies
  • Partners
    • AWS
    • Google
    • Microsoft
    • Oracle
    • Salesforce
    • SAP
    • ServiceNow
    • Stripe
  • Company
    • About
    • NaviVerse
    • Careers
    • Leadership
    • News
    • Press Releases
    • Awards & Recognition
    • Trust & Transparency
    • #NaviGivesBack
    • Contact

Cassandra as distributed NoSQL cloud database across multiple data centers

Megan Ferringer
  • All Posts
  • News
  • Events
  • Tips
  • Insights
  • Spotlight
  • Company

Demand for cloud hosting capacity and availability climbs every day. Throughout this growth, numerous, geographically-separate hosting locations are critical, and multiple data centers supporting a single cloud hosting platform are commonplace.
Distributed hosting services present a new challenge: the need to have a single, easily-accessible system for storage, replication, and aggregation of platform usage, performance, and metric data.
To address this need, we implemented Apache Cassandra, a highly-scalable distributed database, configured with nodes located in several data centers across the world.

The Use Case and Reasoning

We chose to implement a distributed data storage system with Cassandra because it could act as a performant, scalable, redundant, highly fault-tolerant and decentralized system spanning multiple data centers with no single point of failure.
We needed the system to facilitate large-volume storage of infrastructure based usage data, captured in real-time and used by an aggregation system to generate billable usage on a per-customer basis.
This data is critical to our daily operation, so it was a primary concern to minimize the risk that any system malfunction, including a situation like catastrophic data center loss, could result in the loss of this data which must be retained for legal reasons.
Due to the extremely large volume of data and the rate of data accrual, the system needed to be responsive enough to allow for the constant collection of data, as well as facilitate efficient data access needed by daily processes that aggregate usage, in a timely manner, so it is accessible by customers.

Some Downsides

The disadvantages of using Cassandra include a lack of flexibility in accessing data, as compared to a relational database, and also the lack of a reliable method for removing data after it’s added to the system.
For our use case, the downsides were less concerning as we addressed the data access limitations programmatically using Python for calculation and record sorting. We were not concerned with the deletion of data since data would only be inserted, updated, and accessed.
Furthermore, we needed the ability to archive older data, which Cassandra does have easily-implemented mechanisms for doing, as long as the system is configured and used in a specific manner (including column families and keyspaces).

Cassandra Nodes:

We opted to create three Cassandra nodes per data center.

  • A Cassandra Cluster consists of multiple nodes, spanning multiple data centers. Each node runs its own instance of the Cassandra server and uses its local disk for data storage.
  • A node can also be used for data access so having more nodes allows for more efficient data access.
  • Nodes can be added at any time if more capacity is needed.
  • If any single node, or even multiple nodes, become inaccessible the cluster is still able to operate with almost zero impact. This is a huge benefit when having a cluster that spans multiple data centers, as communication between data centers can be impacted for a number of reasons, and the system needs to expect and handle this situation.
  • If a node goes down due to network or system issues, as long as it comes back within a reasonable amount of time, it will rejoin the cluster and even recover any data that was added to the cluster while it was inaccessible. This minimizes the impact from periodic network blips that could break replication between less reliable distributed systems.

Data center awareness:

Since data centers can become isolated, our use case required that each data center could operate independently if necessary.
The single most important concept when setting up a Cassandra cluster spanning multiple data centers is using the correct Cassandra load balancing policy. This policy changes the method used when accessing data from the cluster. The default policy is called ‘RoundRobinPolicy’ which tells the system to equally distribute data access calls among all available nodes in the cluster. In a cluster spanning multiple data centers, this can be very bad.
As an example: You may be accessing data from the western US but data may get pulled from Europe, which will result in the correct data since it is replicated to all nodes, but can run into slower data access and possible network timeouts.
Instead, use ‘DCAwareRoundRobinPolicy.’ This policy ensures when data is accessed from a data center, it is retrieved from the nodes local to that data center, providing the fastest possible data access path. If the local nodes don’t have the data being accessed, the policy will go to nodes in other data centers, but only as a last resort.
Here’s an example of how we configured ‘DCAwareRoundRobinPolicy’ using Python:

from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy
cluster = Cluster(cass_db_hosts)
lb_policy = DCAwareRoundRobinPolicy(datacenter_name)
cluster.load_balancing_policy = lb_policy
session = cluster.connect()

Data center awareness can also be used to configure special-purpose nodes. For example, we have a single node configured as its own data center so it is used for storage, but not data access. The server running this node is configured with regular scheduled backups, both local to the data center and remote. This serves as one more safe guard against any catastrophic issue causing the loss or corruption of our valuable data within the cluster. A backup of this single node could be used to rebuild the entire cluster. (A situation we hope never arises.)

Keyspace configuration:

A Cassandra keyspace is a namespace used for defining data replication on nodes and used to logically segregate a set of column families. (Generally, column families are considered synonymous to tables in traditional database terminology.)

  • Cassandra offers a lot of configuration and tuning at a keyspace level, including defining the data centers used by the keyspace and number of data replicas to be created in each data center.
  • Having multiple keyspaces is a useful way to segregate data that does not need to be stored or accessed in tandem. We used a separate keyspace for each month worth of usage data, because all usage data is aggregated on a monthly basis and there is little need to access data spanning these boundaries at the database level.
  • This approach also reduces the overall number of records each column family must store, making periodic maintenance or data recovery (due to a system issue) easier and less time-consuming.
  • There are multiple tools for manipulating Cassandra data at a per keyspace level. This makes certain optimization or data archival processes much more manageable for a very large data set.

Here’s a brief example script demonstrating how we create keyspaces:

--Data centers include: US1, US2, EU1, EU2
CREATE KEYSPACE produsage_201701 WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'US1:  '3',
  'US2': '3',
  'EU1': '3',
  'EU2': '3'
};

Outcomes

Since implementation of our multiple data center cluster, we have seen many of the expected benefits, including reliable, fast data storage and access, and the ability to recover from both minor and major incidents causing isolation of data centers with minimal manual intervention. This has addressed a number of issues with our previous system and resulted in a much more reliable system for our customers to access their usage and metric data.
There are a vast number of configuration and tuning options with Cassandra, making it an ideal solution for many diverse use cases. Please share your suggestions and thoughts on other implementations in the comments.

You May Also Like

Three Secrets for Passing the VMware VCP6-DCV Exam

Blog
by Megan Ferringer  

Database High Availability in Azure

Blog
by Megan Ferringer  
About the Author
Megan Ferringer
Megan Ferringer

Megan is the Content Marketing Manager at Navisite with more than 10 years of experience helping brands discover and tell their stories. From working at a global non-profit organization to boutique marketing agencies in Chicago, she champions the power of storytelling across all industries.

Categories

  • Blog
  • Cloud
  • Events
  • Insights
  • News
  • Spotlight
  • Tips

About Us

Navisite is a trusted IT services partner for mid-market and smaller enterprise companies. We help our customers maximize business value and accelerate digital transformation with a comprehensive portfolio of enterprise application, data management, security and managed cloud services.

Follow Us & Share

Press Releases

  • Navisite SAP Services for Cannabis Wins 2023 BIG Innovation Award
    January 11, 2023
  • Information Services Group (ISG) Names Navisite a Leader in Public Cloud – Solutions & Services 2022 U.S. Report
    December 8, 2022
  • Navisite President and Chief Transformation Officer Wins 2022 Stevie® Award for Women in Business
    November 16, 2022
  • Navisite Wins Gold Globee® in the 12th Annual 2022 Business Excellence Awards
    October 13, 2022
  • Navisite Expands ‘Next Steminist’ Scholarship Program to India
    October 5, 2022
  • Navisite Appoints Jason Zolczynski as Vice President of Oracle Applications
    September 29, 2022
Configuring TempDB on Azure IaaS for SQL Server
by Megan Ferringer  
            Previous Post
Using vCloud Director API for ISO/OVA Upload
by Megan Ferringer  
Next Post      

Industries

  • Cannabis
  • Healthcare
  • Life Sciences
  • Manufacturing
  • ISV/SaaS

Services

  • Application Services
  • Cloud Marketplaces
  • Cloud Services
  • Data Intelligence & Automation
  • Database Services
  • Infrastructure Services
  • Security Services
  • Supply Chain

Resources

  • Blog
  • Resource Center
  • Events
  • Case Studies

Partners

  • AWS
  • Google
  • Microsoft
  • Oracle
  • Salesforce
  • SAP
  • Service Now
  • Stripe

Company

  • About
  • NaviVerse
  • Careers
  • Leadership
  • News
  • Press Releases
  • Awards & Recognition
  • Trust & Transparency
  • #NaviGivesBack
  • Contact
  • Modern Slavery
We use cookies
This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Cookies
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT