DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • Effective Java Collection Framework: Best Practices and Tips
  • JQueue: A Library to Implement the Outbox Pattern
  • Troubleshooting Memory Leaks With Heap Profilers
  • Python Memo 2: Dictionary vs. Set

Trending

  • Introduction to Retrieval Augmented Generation (RAG)
  • Useful System Table Queries in Relational Databases
  • Memory Leak Due to Time-Taking finalize() Method
  • Building a Real-Time Audio Transcription System With OpenAI’s Realtime API
  1. DZone
  2. Data Engineering
  3. Data
  4. BigList: a Scalable High-Performance List for Java

BigList: a Scalable High-Performance List for Java

By 
Thomas Mauch user avatar
Thomas Mauch
·
Nov. 03, 14 · Interview
Likes (8)
Comment
Save
Tweet
Share
32.3K Views

Join the DZone community and get the full member experience.

Join For Free

As memory gets cheaper and cheaper, our applications can keep more data readily available in main memory, or even all as in case of in-memory databases. To make real use of the growing heap memory, appropriate data structures must be used. Interesting enough, there seem to be no specialized implementations for lists - by far the most used collection.

This article introduces BigList, a list designed for handling large collections where large means that all data still fit completely in the heap memory. The article will show the special requirements for handling large collections, how BigList is implemented and how it compares to other list implementations.

1. Requirements

What are the special requirements we need to handle large collections efficiently?

  • Memory:
  • Sparing use of memory: The list should need little memory for its own implementation so memory can be used for storing application data.
  • Specialized versions for primitives: It must be possible to store common primitives like ints in a memory saving way.
  • Avoid copying large data blocks: If the list grows or shrinks, only a small part of the data must be copied around, as this operation becomes expensive and needs the same amount of memory again.
  • Data sharing: copying collections is a frequent operation which should be efficiently possible even if the collection is large. An efficient implementation requires some sort of data sharing as copying all elements is per se a costly operation.
  • Performance:
  • Good performance for normal operations like reading, storing, adding or removing single elements.
  • Great performance for bulk operations like adding or removing multiple elements.
  • Predictable overhead of operations, so similar operations should need a similar amount of time without excessive worst case scenarios.

If an implementation does not offer these features, some operations will not only be slow for really large collections, but will becomse just not feasible because memory or CPU usage will be too exhaustive.

Introduction to BigList

BigList is a member of the Brownies Collections library which also includes GapList, the fastest list implementation known. GapList is a drop-in replacement for ArrayList, LinkedList, or ArrayDequeue and offers fast access by index and fast insertion/removal at the beginning and at the end at the same time.

GapList however has not been designed to cope with large collections, so adding or removing elements can make it necessary to copy a lot of elements around which will lead to performance problems.

Also copying a large collection becomes an expensive operation, both in term of time and memory consumption. It will simply not be possible to make a copy of a large collections if not the same amount of memory is available a second time. And this is a common operation as you often want to return a copy of an internal list through your API which has no reference the original list.

BigList addresses both problems. The first problem is solved by storing the collection elements in fixed size blocks. Add or remove operations are then implemented to use only data from one block. The copying problem is solved by maintaining a reference count on the fixed size blocks which allows to implement a copy-on-write approach. For efficient access to the fixed size blocks, they are maintained in a specialized tree structure.

2. BigList Details

Each BigList instance stores the following information:

  • Elements are stored in in fixed-size blocks
  • A single block is implemented as GapList with a reference count for sharing
  • All blocks are maintained in a tree for fast access
  • Access information for the current block is cached for better performance

The following illustration shows these details for two instances of BigList which share one block.

2.1 Use of Blocks

Elements are stored in in fixed-size blocks with a default block size of 1000. Where this default may look pretty small, it is most of the time a good choice because it guarantees that write operation only need to move few elements. Read operations will profit from locality of reference by using the currently cached block to be fast.

It is however possible to specify the block size for each created BigList instance. All blocks except the first are allocated with this fixed size and will not grow or shrink. The first block will grow to the specified block size to save memory for small lists.

If a block has reached its maximum size and more data must be stored within, the block needs to be split up in two blocks before more elements can be stored. If elements are added to the head or tail of the list, the block will only be filled up to a threshold of 95%. This allows inserts into the block without the immediate need for split operations.

To save memory, blocks are also merged. This happens automatically if two adjacent blocks are both filled less than 35% after a remove operation.

2.2 Locality of Reference

For each operation on BigList, the affected block must be determined first. As predicted by locality of reference, most of the time the affected block will be the same as for the last operation.

The implementation of BigList has therefore been designed to profit from locality of reference which makes common operations like iterating over a list very efficient.

Instead of always traversing the block tree to determine the block needed for an operation, lower and upper index of the last used block are cached. So if the next operation happens near to the previous one, the same block can be used again without need to traverse the tree.

2.3 Reference Counting

To support a copy-on-write approach, BigList stores a reference count for each fixed size blocks indicating whether this block is private or shared.

Initially all lists are private having a reference count of 0, so that modification are allowed. If a list is copied, the reference count is incremented which prohibits further modifications. Before a modification then can be made, the block must be copied decrementing the block's reference count and setting the reference count of the copy to 0. The reference count of a block is then decremented by the finalizer of BigList.

3. Benchmarks

To prove the excellence of BigList in time and memory consumption, we compare it with some other List implementations. And here are the nominees:

Type

Library

Description

BigList

brownie-collections

List optimized for storing large number of elements. Elements stored in fixed size blocks which are maintained in a tree.

GapList

brownie-collections

Fastest list implementation known. Fast access by index and fast insertion/removal at end and at beginning.

ArrayList

JDK

Maintains elements in a single array. Fast access by index, fast insertion/removal at end, but slow at beginning.

LinkedList

JDK

Elements stored using a linked list. Slow access by index. Memory overhead for each element stored.

TreeList

commons-collections

Elements stored in a tree. All operations are not really fast, but there are no very slow operations. Memory overhead for each element stored.

FastTable

javolution

Elements stored in a "fractal"-like data structure. Good performance and use of memory. However no bulk operations and collection does not shrink.


3.1 Handling Objects

In the first part of the benchmark, we compare memory consumption and performance of the different list implementations.

Let's first have a look at the memory consumption. The following table shows the bytes used to hold a list with 1'000'000 null elements:

BigList

GapList

ArrayList

LinkedList

TreeList

FastTable

32 bit

4'298'466

4'021'296

4'861'992

8'000'028

18'000'028

4'142'892

64 bit

8'544'254

8'042'552

9'723'964

16'000'044

26'000'044

8'222'988

We can see that BigList, GapList, ArrayList, and FastTable only add small overhead to the stored elements, where as Linkedlist needs twice the memory and TreeList even more.

Now to the performance. Here are the results of 9 benchmarks which have been run for each of the 6 candidates with JDK 8 in a 32 bit Windows environment and a list of 1'000'000 elements:

The result table can be read as follows:

  • the fastest candidate for each test has a relative performance indicator of 1
  • the value for the other candidates indicate how many times they have been slower, so a factor of 3 means that this implementation was 3 times slower than the best one
  • The different factor are colored like this:
    • 
factor 1: green (best)

    • factor <5: blue (good)
    • 
factor <25: yellow (moderate)
    • 
factor >25: red (poor)

If we look the benchmark result, we can see that the performance of BigList is best for all expect two benchmarks. The only moderate result is produces in getting elements in a totally random order. This could be expected as there is no locality of reference which can be exploited, so for each access, the block tree must be traversed to find the correct block.

Luckily this is a rare use case in real applications. And the benchmark "Get local" shows that performance is back to good as soon as elements next to each other must be retrieved - as it is the case if we iterate over a range.

3.2 Handling Primitives

In the second part of the benchmark, we want see how big the savings are if we use a data structure specialized for storing primitives compared to strong wrapped objects. For this reason, we compare IntBigList and BigList<Integer>.

The following table shows memory needed to store 1'000'000 integer values:

BigList<Integer>

IntBigList

32 bit

16'298'454

4'534'840

64 bit

28'544'234

4'570'432

Obviously it is easy to save a lot of memory. In a 32 bit environment, IntBigList just needs 25% percent of memory, in a 64 bit environment only 14%! These figures become plausible if you recall that a simple object needs 8 bytes in a 32 bit, but already 16 bytes in a 64 bit environment, where as a primitive integer value always only needs 4 bytes.

The measurable performance gain is not so impressive, it is something below 10% for simple get operations and something above 10% for add and remove operations. These numbers show that the JVM is impressively fast in creating wrapper objects and boxing and unboxing primitive values. We must however also consider that each created object will need to be garbage collected once and therefore adds to the total load of the JVM.

4. Summary

BigList is a scalable high-performance list for storing large collections. Its design guarantees that all operations will be predictable and efficient both in term of performance and memory consumption, even copying large collections is tremendous fast.

Benchmarks haven proven this and shown that BigList outperform other known list implementations.

The library also offers specialized implementations for primitive types like IntBigList which save much memory and provide superior performance.

BigList for handling objects and the specializations for handling primitives are part of the Brownies Collections library and can be downloaded from http://www.magicwerk.org/collections.

Blocks Memory (storage engine) Data structure Element Java (programming language) Locality of reference Database Implementation Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Effective Java Collection Framework: Best Practices and Tips
  • JQueue: A Library to Implement the Outbox Pattern
  • Troubleshooting Memory Leaks With Heap Profilers
  • Python Memo 2: Dictionary vs. Set

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!