DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • How Trustworthy Is Big Data?

Trending

  • Secure by Design: Modernizing Authentication With Centralized Access and Adaptive Signals
  • Kullback–Leibler Divergence: Theory, Applications, and Implications
  • Event-Driven Architectures: Designing Scalable and Resilient Cloud Solutions
  • Chat With Your Knowledge Base: A Hands-On Java and LangChain4j Guide
  1. DZone
  2. Data Engineering
  3. Databases
  4. Cassandra Indexing: The Good, the Bad and the Ugly

Cassandra Indexing: The Good, the Bad and the Ugly

By 
Brian O' Neill user avatar
Brian O' Neill
·
Mar. 23, 12 · Interview
Likes (0)
Comment
Save
Tweet
Share
35.1K Views

Join the DZone community and get the full member experience.

Join For Free
Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms.  It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas)  Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows).

The Bad : Partitioning



One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad.  Thinking back to our storage model however, that isn't surprising.  The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. 


Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host.  You've probably seen this when you initialized your cluster with a "token".  The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range.  


There are two primary partitioners: Random and Order Preserving.  They are appropriately named.  The RandomPartitioner hashes the rowkeys into tokens.  With the RandomPartitioner, the token is a hash of the rowkey.  This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need.  It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query.


Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP).  I am *not* down with OPP.  The OPP preserves order as it translates rowkeys into tokens.  Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for.  It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it)

The Good : Secondary Indexes



Cassandra does provide a native indexing mechanism in Secondary Indexes.  Secondary Indexes work off of the columns values.   You declare a secondary index on a Column Family.  Datastax has good documentation on the usage.  Under the hood,  Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics)  Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes.  Additionally, secondary indexes are not recommended for high-cardinality sets.  I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family".  If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query.


From Ed's presentation:
  • Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.)
  • Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries
  • Unsorted - results are in token order, not query value order
  • Limited to search on datatypes, Cassandra natively understands

With all that said, secondary indexes work out of the box and we've had good success using them on simple values.


The Ugly : Do-It-Yourself (DIY) / Wide-Rows



Now, beauty is in the eye of the beholder.  One of the beautiful things about NoSQL is the simplicity.  The constructs are simple: Keyspaces, Column Families, Rows and Columns.  Keeping it simple however means sometimes you need to take things into your own hands.


This is the case with wide-row indexes.  Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index.  This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code.  The main users column family is keyed on userid, zip code is a column on each user row.  We could use secondary indexes, but there are quite a few zip codes.  Instead we could maintain a column family with a single row called "idx_zipcode".  We could then write columns into this row of the form "zipcode_userid".  Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values).


One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas)  This means that all queries are going to hit a single node.  I haven't yet found a good answer for this.


Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective.  In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra.  With that mentality, the DIY indexes present some trouble.  Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip).  Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift).   This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner.

Making It Better...

For this very reason, we've decided to release two open source projects that help push this logic to the server-side.  The first project is Cassandra-Triggers.  This allows you to attached asynchronous activities to writes in Cassandra.  (one such activity could be indexing)  We've also released Cassandra-Indexing.  This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra.  Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF.  As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)
Database

Published at DZone with permission of Brian O' Neill, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Manual Sharding in PostgreSQL: A Step-by-Step Implementation Guide
  • Why Database Migrations Take Months and How to Speed Them Up
  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • How Trustworthy Is Big Data?

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!