A passionate professional in areas of java performance, distributed systems and in-memory-data-grids. For the last 5 years I was working on various performance critical java systems (usually involving data grids) in areas of finance, telecom and ecommerce. See http://blog.ragozin.info for a list of my articles. Alexey is a DZone MVB and is not an employee of DZone and has posted 4 posts at DZone. You can read more from them at their website. View Full User Profile

How to tame java GC pauses? Surviving 16GiB heap and greater.

06.28.2011
| 74319 views |
  • submit to reddit

Memory is cheap and abundant on modern servers. Unfortunately there is a serious obstacle for using these memory resources to their full in Java programs. Garbage collector pauses are a serious treat for a JVM with a large heap size. There are very few good sources of information about practical tuning of Java GC and unfortunately they seem to be relevant for 512MiB - 2GiB heaps size. Recently I have spent a good amount of time investigating performance of various JVMs with 32GiB heap size. In this article I would like to provide practical guidelines for tuning HotSpot JVM for large heap sizes.

You may also want to look at two articles explaining particular aspects of HotSpot collectors in more details “Understanding GC pauses in JVM, HotSpot's minor GC” and “Understanding GC pauses in JVM, HotSpot's CMS collector”.

Target application domain

GC tuning is very application specific. This article is about tuning GC for java applications using large heaps (10GiB and greater) and having strict response time SLA (few milliseconds). My practical experience is mostly with data grid type of software, but these tips should also apply to other applications have following characteristics:

  • Heap is used to store data structures in memory.
  • Heap size 10GiB and more.
  • Request execution time is small (up to dozens of milliseconds).
  • Transactions are short (up to hundreds of milliseconds). Transaction may include several requests.
  • Data in memory is modified slowly (e.i. we do not modify whole 10GiB in heap within one seconds, though updating of 10MiB data in heap per second is ok).
I'm working with java GC implementations for a long time. Advises in this article are from my practical experience. GC economics for 2GiB heap and 10GiB heap are totally different, keep it in mind while reading.

Economy of garbage collection

Garbage collection algorithms can be either compacting or non-compacting. Compacting algorithms are relocating objects in heap to reclaim unused memory, while non-compacting are managing fragmented heap space. For both kinds of algorithms effort for reclamation of free space is proportional to ratio of garbage vs. live objects (amount of garbage / amount of live objects). In other words, more memory we can spare for garbage (waste) more effective collector algorithm will be (in terms of work effort, not necessary GC pauses).

Normally we cannot afford to have more than half of memory to be wasted as garbage, nor do we want to have more than half of CPU horse power to be busy with memory cleaning. In general CPU efficiency of garbage collector is reverse proportion with memory efficiency.

Solution to this Gordian knot lies in “weak generational hypothesis”. This hypothesis postulates that:

  • Most objects become garbage short after creation (die young).
  • Number of references from “old” to “young” objects is small.
That means, if we would use different collection algorithms for young and old objects we can achieve better efficiency compared to single algorithm approach. Using different algorithms requires splitting of heap into two spaces. Space for young objects will have lower density of live, so garbage will be collected efficiently (to compensate high death rate). Space for old objects will have higher density and in most cases large size, this way we can waste less memory for garbage. Cost of memory reclamation in old space will be higher but it will be compensated with lower death rate. In genarationally organized heap new object always allocated in young space, then promoted to older space if they survive (one or several young space collection). Most popular aproach is to use 3 generational spaces:
  • space for allocation (eden) - youngest objects,
  • survival space - objects survived one collection (survival space is colected together with allocation space),
  • tunured (old) space - objects survived several young collections.

Object demography

Below is a chart showing an example of demography we could expect from our class of application.

On this chart we have several critical points. Most important is period of young space collection. Object live-time distribution have a peak, and it is important to ensure that all this short lived objects are collected in young space. We can control period of young GC, so it is important aspect of tuning. Another key point is a period of old GC, period of old GC depends on how many memory we can waste for garbage in old space (e.i. we are not going to tune old GC period to improve demography chart). Objects with lifetime between these two points are mid-aged garbage. Criteria of good demography is to keep Ryoung >> Rold >> Rmid_aged (there R is death rate in corresponding lifetime diapason).

Shape of demography can be improved by tuning young collections (size of young space, size of survival spaces, tenuring threshold). Period of old space collection is dictated to total heap size, and long lived objects death ratio, so it can be considered a constant concerning demography shape.

Fortunately, this is quite natural distribution for server type application which does not execute long transaction this is quite normal distribution. But if length of transactions will exceeds period of young collection they will start contributing to mid-age which is a bad for GC performance. Another treat of GC efficiency is bad caching strategy, producing large amount of mid-aged garbage.

Confirming demography of application

Before starting tuning of GC we must ensure that our application demography is in good shape. Before doing any GC measurement you should come up with some test scenario which ensure constant load to application (performance tests are usually good choice). Instruction below is for HotSpot JVM. If you use other JVM you still can do these measurements using HotSpot (HotSpot only JVM which displays exact demography of objects in young space).

You should configure GC for “demographics research”. For "demographic research" we should make survival space same size as Eden (-XX:SurvivorRatio=1), increase new space to account grow of survivor spaces is also a good idea (-XX:MaxNewSize=<n> -XX:NewSize=<n>) you should triple your young space size. Finally we need to enable diagnostic options (-XX:+PrintGCDetails -XX:+PrintTenuringDistribution -XX:+PrintGCTimestamps).

In research configuration logs of your application will be filled with JVM GC diagnostic frames like this:

Skip some initial logging (while your application is loading data and stabilizes). From logs you can calculate several important GC metrics. Second column show cumulative size of each generation of objects (generations are divided by young collection event). You should see fairly same size of different generations. If first one or two generations are significantly large than older generations – your young collections are two frequent, you have to increase Eden size to increase period of collection. If size of generation is not stabilizing, this indicates problem with your application demography.

Young collection period. You have simple to calculate time between timestamps.

Total allocation rate. Size of Eden + size of survivor space – size of all ages (bottom value in rightmost column) divided by young collection period.

Short lived objects allocation rate. Eden size (young space size – 2 * survivor space size) – size of age 1 reported by collector divided by collection period.

Long lived object allocation rate.  Size oldest age divided by collection period.

Mid aged objects allocation rate. Total allocation rate – short lived object allocation rate – long lived object allocation rate.

Having these numbers you can verify health of your demography.

Scalability of different GC algorithms

My experience tells that HotSpot’s CMS is most robust GC for 10-30Gb heaps (30Gb is my practical limit for single JVM so far). Unlike its main competitors like HotSpot’s G1 and JRockit, HotSpot’s CMP is not compacting. CMS does not relocate objects in memory. Theoretically it makes it prone for fragmentation, but in my experience compacting algorithms are having even more problems with fragmentation in practice while CMS is smart enough to manage fragmented memory efficiently (e.g. it uses different free lists for different object sizes). Once again we are speaking about >10GiB heap sizes, while I could agree that for small heap size fragmentation can be an issue, for big heaps it never was a problem for me.

Another argument against compacting collector is that coping object is not just CPU cycles waste to move data and update of reference sites; coping can only be done during stop-the-world pause.  For compacting collector sum of pauses is proportional to amount of reclaimed memory (which is a bad math). In practice we have to waste more memory for garbage in old space to keep efficiency of compacting collector reasonable.

Azul’s Zing JVM can do coping without stopping application threads. But it is using hardware read barrier and cannot be run or regular OS (though it can be run as virtual appliance on top of commodity hardware).

Unlike compacting collectors CMS can effectively work even if heap density is very high (~80%) while having pauses defined mostly by geometry of heap (on 10Gb and more young and remark pause time are dominated by time spent to scan card table).

CMS collector tuning for large heap

Pauses in CMS

CMS is doing most of its work in parallel with application, though few types of pauses cannot be avoided. If CMS is enabled JVM will experience following types of pauses:

  • Young space collection – for large heaps this pause is dominated to time to scan card table, which is proportional to size of old space.
  • Initial mark pause – if initial mark is done right after young collection, its time doesn’t depend on heap size at all.
  • Remark – time of this pause is also dominated by time to scan card table.

As you can there is nothing much we can do, card table scan time will grow as we increasing heap size. Below is check list of CMS collector tuning:

  • Check that demography of your application matches is good for generational GC.
  • Choose young to old object promotion strategy.
  • Choose size of young space.
  • Configure CMS for short pauses.
  • Choose old space size.
  • Configure parallel GC options.

Check demography

Check you demography, if it doesn’t fit “generational hypothesis” you may consider fixing you application. Typical problem is mixing in single JVM processing of online transaction (requiring low response time) and batch processing producing mid-age garbage (and for batch operations GC pauses are not that critical). If this is your case you should divide your application into separate JVM for online transactions and batches.

Caching of object and garbage collection

One of potential treat for GC friendly demography is bad caching strategy. Application level caching can be unanticipated source of middle aged garbage. Following changes in caching strategy can solve GC problems due to application caching:

  • add expiry to cache, and keep young GC periods few times longer than cache expiry (or delay tenuring of objects),
  • increase young GC periods,
  • use weak references for cache (though using of weak reference may cause other problem, you should test such solution carefully),
  • reduce cache (it may be lesser evil than GC problems),
  • increase cache size (this will reduce object eviction from cache, so death rate and GC problems, of cause you will have to pay with additional memory for this).

Object promotion strategy

JVM allocates all objects in young space, later surviving objects either relocated to old space or relocated in young space for each young collection. Coping of objects inside young space increases time of young collection pause. For our type of application two practical choices are

  • copy object to old space during first collection,
  • copy object to old space during second collection.

First option will leak few short lived objects to old space, but if period of young collection is much longer than average short lived object live time number of such object will be very small. Using second option (wait until second collection) may make sense if young collections are very frequent or number of long lived object is very small (so we do not spend much time to copy them inside young space).

Balance size of young space

For large heaps main contributors to young collection pauses are time to scan card space (Tcard_scan) and time to copy live objects (Tcopy). Tcard_scan is fixed for your heap size, but Tcopy will be proportional to size of young space. Making young space large will make pause less frequent (which is good), but may increase pause length by increasing T. You should find balance for your application. Unfortunately Tcard_scan cannot be reduced; it is your lower bound for pause time. If you increasing young space you should also increase total heap size of JVM, otherwise size of your old space will be reduced. Remember you are not using young space for your data, but for garbage only. You should always size old space to be large enough to hold all your application data plus some amount of floating garbage.

Configure CMS for short pauses

Use advises from previous article to configure CMS for low pauses.

  • Set –XX:CMSWaitDuration=<t> to be at least twice longer than max interval between young collection pauses (usually this interval increases if application is not loaded, you have to keep it in mind)
  • Set –XX:+CMSScavengeBeforeRemark to avoid Eden scanning during collection (on large heaps using this flag makes pauses more stable at price of little increase in pause time, though for small heap this may not be a best option) .

Providing head room for CMS to work

You should also reserve more memory for JVM than size of live objects expected to be in heap (jmap is excellent tools to measure space required by your application objects). As I said CMS can handle dense heaps pretty well, in my experience 20% is enough for CMS headroom. But remember what you should calculate 20% from old space size (Xmx – XX:MaxNewSize). Actually required headroom depends on death rate of old objects in your application and if it is high you may need more headroom for CMS to work.

Utilize all your cores

HotSpot has effective parallel implementation for most phases of GC. Usually specifying –server flag is enough for JVM to choose good defaults. In certain cases you may want to reduce number of parallel threads for GC (by default it is equals to number of cores), e.g. if it is affecting other application. But usually –server is good enough.

Conclusion

Using advises above and HotSpot’s CMS, I was able to keep GC pause on 32Gb Oracle Coherence storage node below 150ms on 8 core server. Though this may be good enough, modern garbage collectors have a lot room for improvements. My experiments with JRockit have shown what it can keep better young collection pauses than HotSpot, but compacting algorithm for old space makes it unstable on 32Gb JVM (pauses during old collections are just unreasonable). Unfortunately JRockit does not have option to use non compacting collector. HotSpot’s G1 also has potential but it is prone to same problem as JRockit – sporadically pause time becomes unreasonably long (few seconds). Also both JRockit (congen) and HotSpot’s G1 are requiring much more head room in old space (read memory wasted). So at the moment CMS is only algorithm providing stable performance on 32Gb heap.

See also

Published at DZone with permission of Alexey Ragozin, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Varunkumar Mall... replied on Tue, 2011/06/28 - 3:47am

It is very rare to find articles in niche topics like GC pauses. Thanks a lot for posting these.

Manuel Jordan replied on Tue, 2011/06/28 - 9:22am

Really valuable, thank you!

Jacek Furmankiewicz replied on Tue, 2011/06/28 - 9:40am

Could you do a similar evaluation for JRockit, which is now free and supposed to handle large heaps easier?

Alexey Ragozin replied on Tue, 2011/06/28 - 1:26pm in response to: Jacek Furmankiewicz

Hi,

I have done a lot of testing with JRockit and HotSpot G1. I will probably share my findings eventually but so far I have yet to learn how to cook them right.

Speaking of JRockit it has much better pause times for young collections compared to HotSpot, but object relocation pauses are just to long (CMS does not relocate objects at all, so it has upper hand here).

Jonathan Fisher replied on Tue, 2011/06/28 - 3:01pm

Anyone tried this is practice? http://www.azulsystems.com/products/zing/whatisit It's supposedly pauseless collection JVM.

Peter Levart replied on Tue, 2011/06/28 - 3:17pm

We have a mostly batch-processing app that uses 28GiB of heap (for Coherence cache) and are running on Niagara SPARC chips (8 cores / 8 threads per core). We are using parallel scavenging collector for young generation and parallel compacting collector for old generation employing 32 threads for GC:

-XX:+DisableExplicitGC   // important to prevent program induced collections from various ill-behaved libraries
-XX:+UseParallelGC
-XX:ParallelGCThreads=32
-XX:+UseParallelOldGC

We reduced heap usage quite a bit by employing 64bit -> 32bit object pointer compression:

-XX:+UseCompressedOops

Our experience shows that young generation pauses are short enough to be tolerable and old generation pauses, though quite long (30sec+) only happen once or twice a day. We manage to clean most of garbage with young gen. scavenging.

I never tried CMS because I had the impression that it is not suitable for such large heaps. Now that I see others using it for 30GiB heaps successfully, I'll try it too. I doubt that It can increase application throughput, but it might help eliminating those few long pauses we experience once or twice a day.

Question: Isn't CMS old-generation collector only wich is used in combination with parallel or single-threaded stop-the-wrorld young-generation scavenging collector?

Thank you for a nice article.

Alexey Ragozin replied on Tue, 2011/06/28 - 4:43pm in response to: Peter Levart

Hi Peter,

CMS can use eigther serial of parallel young space collector:

-XX:+UseParNewGC - will force using parallel one,

you can also tweek thread usage by following option (though -server has smart enough defaults so I rarely touch these)

‑XX:+CMSConcurrentMTEnabled – allows CMS to use multiple cores for concurrent phase.‑XX:+ConcGCThreads= – specifies number of thread for concurrent phases.‑XX:+ParallelGCThreads= – specifies number of thread for parallel work during stop-the-world pauses (by default it equals to number of physical cores).

Link metioned abouve http://blog.griddynamics.com/2011/06/understanding-gc-pauses-in-jvm-hotspots_02.html is specifically about tunning CMS.

My research was focused on low pause use cases. I have never compared CMS with ParallelOldGC in batch type of workload. Theoretically compacting collector like ParallelOldGC is putting much more effort in reclaiming free memory so CMS may have serious advantage here. I would say that if most of your objects are surviving several old collections CMS should show better results, but if your object life time in old space is roughly same as old GC intervals ParallelOldGC may show better cummulative result.

Anyway I believe it is worth trying.

Regards,

 

Keith Barret replied on Thu, 2011/06/30 - 5:05am

Thanks a lot for this very excellent article!

We recently merged multiple Tomcat installations (2GB each) into one large JVM (10GB) and soon expierenced stability issues with the parallel GC causing SIGSEVs in oop_follow_contents and PSMarkSweepDecorator::precompact. We'll try to follow your recommendation and switch over to CMS which seems to be the more stable and resource efficient GC strategy.

 

Peter Levart replied on Thu, 2011/06/30 - 6:29am

Tanks, Alexey, for pointing me to the link explaining CMS in details. I couldn't find such an in-depth explanation of various CMS parameters in any of the Sun/Oracle official documents.

Parwinder Sekhon replied on Sat, 2011/07/02 - 3:23am

>>but Tcopy will be proportional to size of young space. Making young space large will make pause less frequent (which is good), but may increase pause length by increasing T

Strictly speaking the Tcopy is proportional to the number of live objects left in the young space at minor gc time.  So with a young heap of 64M, if every minor gc,  100K of objects are still alive, the minor GC will be much much faster than where say 30M of objects are still alive.  I have found that increasing the newsize can help reduce frequency of minor GCs without increasing the minor GC time.  However this is only the case in applications/systems where objects lifetimes are extremely short, where the app has fairly variable object lifetimes, then increasing the newsize can really increase the minor gc time.

Alexey Ragozin replied on Wed, 2011/07/06 - 6:21am in response to: Parwinder Sekhon

You are right, but I'm also right.

Assimptoticall if young collection periods is long enough, most object to be copied would be long lived objects,and their number would be proportional to period between GC. I didn't state this assumption clearly though.

If young space is smaller (or live time of short-lived garbage is longer), short lived object will dominate copy time and increasing periods between collections will actualy reduce number of copied objects.

 

It seems that I have to review article and elebarate that kind of assumptions to avoid confusion.

 

Thank you, for your comment

Vlad Rodionov replied on Thu, 2011/07/14 - 3:57pm

Yep, a lot of voodoo magic. Just several questions: have you run load/performance testing on your Coherence 32G box? How many requests per sec can it sustain during several hours run? Mixed requests (insert/update/delete/get) What is eviction algorithm in your set up? Do you do eviction at all? What is the cache size (in items? in bytes?)

You can do voodoo or you can go other way (w/o any voodoo):

Koda benchmark

 

-Vladimir Rodionov

 

 

Sirikant Noori replied on Sun, 2012/01/15 - 12:15pm

Very good list, Main problem with JVM options is that they are lot and you can't remember all of those so having some of them is good idea.

Jian Jin replied on Mon, 2012/04/09 - 8:54pm

Is the way to calculate demography correct?

 I assume the total allocation rate should be Eden divided by yong collection period

 and Long Lived Object allocation rate should be object promoted to old space and lived for a long enough time, i think we should take Major GC's Stats to calculate Long-Lived and Mid-Aged's calculation.

 

Ricky Clarkson replied on Sat, 2012/07/07 - 6:33pm

I really like the content, but I have to say that I find this difficult to read.  There are only two instances of the word 'the' in the entire article.

Varun Tyagi replied on Fri, 2014/01/10 - 3:40pm in response to: Alexey Ragozin

 Excellent Stuff!!

I need your help in one of our mobile/smartphone application, it has below stated GC parameters. My Prod Server OS is RHEL 6.2 , RAM is 10GB, 2 Core processor and only one instance runs on this box.
 The Young GC collection times seems to be ok but Full GC took 1.7 seconds on an average with 4 GB of heap. I tried implementing Parallel Old GC but it makes things more worst. I can't even go for CMS GC as machine has only 2 CPU core and CMS GC add more CPU cycles.

I am really strugging with this, could you please suggest me what options shall i use.

-Xmx4096M -XX:PermSize=128M -XX:MaxPermSize=128M -XX:+HeapDumpOnOutOfMemoryError -verbose:gc -Xloggc:logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.