PacktLib: Cassandra High Performance Cookbook

Cassandra High Performance Cookbook


About the Author

About the Reviewers


Getting Started


A simple single node Cassandra installation

Reading and writing test data using the command-line interface

Running multiple instances on a single machine

Scripting a multiple instance installation

Setting up a build and test environment for tasks in this book

Running in the foreground with full debugging

Calculating ideal Initial Tokens for use with Random Partitioner

Choosing Initial Tokens for use with Partitioners that preserve ordering

Insight into Cassandra with JConsole

Connecting with JConsole over a SOCKS proxy

Connecting to Cassandra with Java and Thrift

The Command-line Interface

Connecting to Cassandra with the CLI

Creating a keyspace from the CLI

Creating a column family with the CLI

Describing a keyspace

Writing data with the CLI

Reading data with the CLI

Deleting rows and columns from the CLI

Listing and paginating all rows in a column family

Dropping a keyspace or a column family

CLI operations with super columns

Using the assume keyword to decode column names or column values

Supplying time to live information when inserting columns

Using built-in CLI functions

Using column metadata and comparators for type enforcement

Changing the consistency level of the CLI

Getting help from the CLI

Loading CLI statements from a file

Application Programmer Interface


Connecting to a Cassandra server

Creating a keyspace and column family from the client

Using MultiGet to limit round trips and overhead

Writing unit tests with an embedded Cassandra server

Cleaning up data directories before unit tests

Generating Thrift bindings for other languages (C++, PHP, and others)

Using the Cassandra Storage Proxy "Fat Client"

Using range scans to find and remove old data

Iterating all the columns of a large key

Slicing columns in reverse

Batch mutations to improve insert performance and code robustness

Using TTL to create columns with self-deletion times

Working with secondary indexes

Performance Tuning


Choosing an operating system and distribution

Choosing a Java Virtual Machine

Using a dedicated Commit Log disk

Choosing a high performing RAID level

File system optimization for hard disk performance

Boosting read performance with the Key Cache

Boosting read performance with the Row Cache

Disabling Swap Memory for predictable performance

Stopping Cassandra from using swap without disabling it system-wide

Enabling Memory Mapped Disk modes

Tuning Memtables for write-heavy workloads

Saving memory on 64 bit architectures with compressed pointers

Tuning concurrent readers and writers for throughput

Setting compaction thresholds

Garbage collection tuning to avoid JVM pauses

Raising the open file limit to deal with many clients

Increasing performance by scaling up

Consistency, Availability, and Partition Tolerance with Cassandra


Working with the formula for strong consistency

Supplying the timestamp value with write requests

Disabling the hinted handoff mechanism

Adjusting read repair chance for less intensive data reads

Confirming schema agreement across the cluster

Adjusting replication factor to work with quorum

Using write consistency ONE, read consistency ONE for low latency operations

Using write consistency QUORUM, read consistency QUORUM for strong consistency

Mixing levels write consistency QUORUM, read consistency ONE

Choosing consistency over availability consistency ALL

Choosing availability over consistency with write consistency ANY

Demonstrating how consistency is not a lock or a transaction

Schema Design


Saving disk space by using small column names

Serializing data into large columns for smaller index sizes

Storing time series data effectively

Using Super Columns for nested maps

Using a lower Replication Factor for disk space saving and performance enhancements

Hybrid Random Partitioner using Order Preserving Partitioner

Storing large objects

Using Cassandra for distributed caching

Storing large or infrequently accessed data in a separate column family

Storing and searching edge graph data in Cassandra

Developing secondary data orderings or indexes


Defining seed nodes for Gossip Communication

Nodetool Move: Moving a node to a specific ring location

Nodetool Remove: Removing a downed node

Nodetool Decommission: Removing a live node

Joining nodes quickly with auto_bootstrap set to false

Generating SSH keys for password-less interaction

Copying the data directory to new hardware

A node join using external data copy methods

Nodetool Repair: When to use anti-entropy repair

Nodetool Drain: Stable files on upgrade

Lowering gc_grace for faster tombstone cleanup

Scheduling Major Compaction

Using nodetool snapshot for backups

Clearing snapshots with nodetool clearsnapshot

Restoring from a snapshot

Exporting data to JSON with sstable2json

Nodetool cleanup: Removing excess data

Nodetool Compact: Defragment data and remove deleted data from disk

Multiple Datacenter Deployments

Changing debugging to determine where read operations are being routed

Using IPTables to simulate complex network scenarios in a local environment

Choosing IP addresses to work with RackInferringSnitch

Scripting a multiple datacenter installation

Determining natural endpoints, datacenter, and rack for a given key

Manually specifying Rack and Datacenter configuration with a property file snitch

Troubleshooting dynamic snitch using JConsole

Quorum operations in multi-datacenter environments

Using traceroute to troubleshoot latency between network devices

Ensuring bandwidth between switches in multiple rack environments

Increasing rpc_timeout for dealing with latency across datacenters

Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments

Using the consistency levels TWO and THREE

Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner

Coding and Internals


Installing common development tools

Building Cassandra from source

Creating your own type by sub classing abstract type

Using the validation to check data on insertion

Communicating with the Cassandra developers and users through IRC and e-mail

Generating a diff using subversion's diff feature

Applying a diff using the patch command

Using strings and od to quickly search through data files

Customizing the sstable2json export utility

Configure index interval ratio for lower memory usage

Increasing phi_convict_threshold for less reliable networks

Using the Cassandra maven plugin

Libraries and Applications


Building the contrib stress tool for benchmarking

Inserting and reading data with the stress tool

Running the Yahoo! Cloud Serving Benchmark

Hector, a high-level client for Cassandra

Doing batch mutations with Hector

Cassandra with Java Persistence Architecture (JPA)

Setting up Solandra for full text indexing with a Cassandra backend

Setting up Zookeeper to support Cages for transactional locking

Using Cages to implement an atomic read and set

Using Groovandra as a CLI alternative

Searchable log storage with Logsandra

Hadoop and Cassandra


A pseudo-distributed Hadoop setup

A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat

A Map-only program that writes to Casandra using the CassandraOutputFormat

Using MapReduce to do grouping and counting with Cassandra input and output

Setting up Hive with Cassandra Storage Handler support

Defining a Hive table over a Cassandra Column Family

Joining two Column Families with Hive

Grouping and counting column values with Hive

Co-locating Hadoop Task Trackers on Cassandra nodes

Setting up a "Shadow" data center for running only MapReduce jobs

Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive

Collecting and Analyzing Performance Statistics

Finding bottlenecks with nodetool tpstats

Using nodetool cfstats to retrieve column family statistics

Monitoring CPU utilization

Adding read/write graphs to find active column families

Using Memtable graphs to profile when and why they flush

Graphing SSTable count

Monitoring disk utilization and having a performance baseline

Monitoring compaction by graphing its activity

Using nodetool compaction stats to check the progress of compaction

Graphing column family statistics to track average/max row sizes

Using latency graphs to profile time to seek keys

Tracking the physical disk size of each column family over time

Using nodetool cfhistograms to see the distribution of query latencies

Tracking open networking connections

Monitoring Cassandra Servers


Forwarding Log4j logs to a central sever

Using top to understand overall performance

Using iostat to monitor current disk performance

Using sar to review performance over time

Using JMXTerm to access Cassandra JMX

Monitoring the garbage collection events

Using tpstats to find bottlenecks

Creating a Nagios Check Script for Cassandra

Keep an eye out for large rows with compaction limits

Reviewing network traffic with IPTraf

Keep on the lookout for dropped messages

Inspecting column families for dangerous conditions