Cassandra Term Soup

wood-cube-491720_640

Node

Where you store your data. It is the basic infrastructure component of Cassandra.

Data center

A collection of related nodes. A data center can be a physical data center or virtual data center. Different workloads should use separate data centers, either physical or virtual. Replication is set by data center. Using separate data centers prevents Cassandra transactions from being impacted by other workloads and keeps requests close to each other for lower latency. Depending on the replication factor, data can be written to multiple data centers. However, data centers should never span physical locations.

Cluster

A cluster contains one or more data centers. Clusters can span physical locations.

Commit log

All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted, or recycled.

Keyspace (database)

The Cassandra keyspace is a namespace that defines how data is replicated on nodes. Typically, a cluster has one keyspace per application. Replication is controlled on a per-keyspace basis, so data that has different replication requirements typically resides in different keyspaces.

Column family (table)

A collection of ordered columns fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name.

memtable

A Cassandra table-specific, in-memory data structure that resembles a write-back cache. When a write occurs, Cassandra stores the data in a structure in memory, the memtable, and also appends writes to the commit log on disk, providing configurable durability.

SSTable

A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

Bloom filter

An off-heap structure associated with each SSTable that checks if any data for the requested row exists in the SSTable before doing any disk I/O.

Gossip (internode communication)

Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version associated with it, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.

Seed Node

The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.

Consistent Hashing/Partitioner/Murmur3

Data distribution is done in Cassandra through a consistent hashing algorithm. To distribute the rows across the nodes, a partitioner is used.
The partitioner uses an algorithm to determine which node a given row of data will go to. The default partitioner in Cassandra is Murmur3

Replication

A replication factor must be specified whenever a database (keyspace) is defined. The replication factor specifies how many instances  of the data there will be within a given database.

Snitches

A snitch determines which data centers and racks are written to and read from. Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. All nodes must have exactly the same snitch configuration. Cassandra does its best not to have more than one replica on the same rack (which is not necessarily a physical location).

CAP Theorem

The CAP theorem states that you have to pick two of Consistency, Availability, Partition tolerance: You can’t have the three at the same time and get an acceptable latency. Cassandra values Availability and Partitioning tolerance (AP). Tradeoffs between consistency and latency are tunable in Cassandra. You can get strong consistency with Cassandra (with an increased latency).

MongoDB: Find and remove duplicates without MapReduce

The following command will find all emails that exists more than once in a collection called users

db.users.aggregate([
 {$group: {_id: "$email", count: {$sum: 1}}},
 {$match: {count: {$gt: 1} }} 
])

The following command will create a unique index on email and delete the duplicates

db.users.createIndex( {email: 1}, {unique: true, dropDups: true} )