Apache Cassandra

Hi all, after a few time period I’m back with a new topic. Next couple of blog posts will be for Apache Cassandra, a no-sql database. Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure.

Also now Cassandra has become very popular as a big data solution as well. Cassandra was developed at Facebook as a solution for their inbox search. Then they make the solution as open source and later it was published as an Apache project.

When talking about the features and advantages of Cassandra following are highlighted,

  • Open source
  • No single point of failure as it’s having a peer to peer architecture
  • Elastic scalability
  • High availability
  • Fault tolerance
  • High performance
  • Column oriented
  • Tunable consistency
  • Schema free
  • Easy data distribution
  • Transaction support
  • Fast data writes
  • Cloud enabled
  • Data compression
  • No caching layer required
  • No special hard ware needed
  • Simple install and setup

When talking about the data architecture of Cassandra, the design goal is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster. All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes. Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster. When a node goes down, read/write requests can be served from other nodes in the network. So because of these reasons Cassandra is a having a high availability and no data loss as well.

Now let’s have a look at the key components in Cassandra.

  • Node − The place where the data is stored.
  • Data center − A collection of related nodes.
  • Cluster − A component that contains one or more data centers.
  • Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
  • Mem-table − After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  • SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.

As we can see first the write will be happen on a commit log. Then it will write to a Mem-Table and later to the SSTable. So the crash recovery is also very easy as we have a commit log and also a Mem-Table.

As now we have a basic understanding about what Cassandra is, now let’s have a comparison with MySQL.

12

Now let’s have a look at data types of Cassandra with comparison of MySQL.

3

The next main question is how to do querying in Cassandra. MySQL has SQL for querying. Like that Cassandra come with CQL, Cassandra Query Language which is a SQL like language. Queries are done via the standard SELECT command, while DML operations are accomplished via the familiar INSERT, UPDATE, DELETE, and TRUNCATE commands. DDL commands such as CREATE are used to create new keyspaces and column families. Although CQL has many similarities to SQL, it does not change the underlying Cassandra data model. There is no support for JOIN commands.

Hope now you have a clear idea about Cassandra. From next post I’ll show how to install Cassandra and how to do querying. Hope to see you soon with the second post in Cassandra. Thank You!

 

Advertisements

3 thoughts on “Apache Cassandra

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s