NoSQL Zone is brought to you in partnership with:

If I had all the money I'd ever need, I would still write software - every day. I love putting together the pieces of intricate software puzzles and creating novel and useful solutions for myself and those around me. When I find a solution to a problem which hasn't been well documented, I try to write about it to save the next poor soul from trudging through the same waters. We're all in this together, after all. Jim is a DZone MVB and is not an employee of DZone and has posted 1 posts at DZone. You can read more from them at their website. View Full User Profile

Understanding HBase and BigTable

05.22.2008
| 38414 views |
  • submit to reddit

multidimensional

Up to this point, we haven't mentioned any concept of "columns", treating the "table" instead as a regular-old hash/map in concept. This is entirely intentional. The word "column" is another loaded word like "table" and "base" which carries the emotional baggage of years of RDBMS experience.

Instead, I find it easier to think about this like a multidimensional map - a map of maps if you will. Adding one dimension to our running JSON example gives us this:

{
"1" : {
"A" : "x",
"B" : "z"
},
"aaaaa" : {
"A" : "y",
"B" : "w"
},
"aaaab" : {
"A" : "world",
"B" : "ocean"
},
"xyz" : {
"A" : "hello",
"B" : "there"
},
"zzzzz" : {
"A" : "woot",
"B" : "1337"
}
}

In the above example, you'll notice now that each key points to a map with exactly two keys: "A" and "B". From here forward, we'll refer to the top-level key/map pair as a "row". Also, in BigTable/Hbase nomenclature, the "A" and "B" mappings would be called "Column Families".

A table's column families are specified when the table is created, and are difficult or impossible to modify later. It can also be expensive to add new column families, so it's a good idea to specify all the ones you'll need up front.

Fortunately, a column family may have any number of columns, denoted by a column "qualifier" or "label". Here's a subset of our JSON example again, this time with the column qualifier dimension built in:

{
// ...
"aaaaa" : {
"A" : {
"foo" : "y",
"bar" : "d"
},
"B" : {
"" : "w"
}
},
"aaaab" : {
"A" : {
"foo" : "world",
"bar" : "domination"
},
"B" : {
"" : "ocean"
}
},
// ...
}

Notice that in the two rows shown, the "A" column family has two columns: "foo" and "bar", and the "B" column family has just one column whose qualifier is the empty string ("").

When asking Hbase/BigTable for data, you must provide the full column name in the form "<family>:<qualifier>". So for example, both rows in the above example have three columns: "A:foo", "A:bar" and "B:".

Note that although the column families are static, the columns themselves are not. Consider this expanded row:


{
// ...
"zzzzz" : {
"A" : {
"catch_phrase" : "woot",
}
}
}

In this case, the "zzzzz" row has exactly one column, "A:catch_phrase". Because each row may have any number of different columns, there's no built-in way to query for a list of all columns in all rows. To get that information, you'd have to do a full table scan. You can however query for a list of all column families since these are immutable (more-or-less).

The final dimension represented in Hbase/BigTable is time. All data is versioned either using an integer timestamp (seconds since the epoch), or another integer of your choice. The client may specify the timestamp when inserting data.

Consider this updated example utilizing arbitrary integral timestamps:

{
// ...
"aaaaa" : {
"A" : {
"foo" : {
15 : "y",
4 : "m"
},
"bar" : {
15 : "d",
}
},
"B" : {
"" : {
6 : "w"
3 : "o"
1 : "w"
}
}
},
// ...
}

Each column family may have its own rules regarding how many versions of a given cell to keep (a cell is identified by its rowkey/column pair) In most cases, applications will simply ask for a given cell's data, without specifying a timestamp. In that common case, Hbase/BigTable will return the most recent version (the one with the highest timestamp) since it stores these in reverse chronological order.

If an application asks for a given row at a given timestamp, Hbase will return cell data where the timestamp is less than or equal to the one provided.

Using our imaginary Hbase table, querying for the row/column of "aaaaa"/"A:foo" will return "y" while querying for the row/column/timestamp of "aaaaa"/"A:foo"/10 will return "m". Querying for a row/column/timestamp of "aaaaa"/"A:foo"/2 will return a null result.

sparse

The last keyword is sparse. As already mentioned, a given row can have any number of columns in each column family, or none at all. The other type of sparseness is row-based gaps, which merely means that there may be gaps between keys.

This, of course, makes perfect sense if you've been thinking about Hbase/BigTable in the map-based terms of this article rather than perceived similar concepts in RDBMS's.

And that's about it

Well, I hope that helps you understand conceptually what the Hbase data model feels like.

As always, I look forward to your thoughts, comments and suggestions.

References
Published at DZone with permission of Jim Wilson, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Rameshkumar Ramasamy replied on Thu, 2009/07/09 - 7:47am

Thanks a lot Jim!

 This article gives me better understating about HBase.

 

Keith Thomas replied on Tue, 2009/08/04 - 10:19am

Thanks for the article, just what I was looking for.

Vijay Bhas replied on Fri, 2012/02/24 - 1:45am

This article helps a lot for a beginner like me.

Thanks,

Vijay

Find command in linux

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.