Enterprise Integration Zone is brought to you in partnership with:

Pierre-Hugues Charbonneau (nickname P-H) is working for CGI Inc. Canada for the last 10 years as a senior IT consultant and system architect. His primary area of expertise is Java EE, middleware & JVM technologies. He is a specialist in production system troubleshooting, middleware & JVM tuning and scalability / capacity improvement; including processes improvement for Java EE support teams. Pierre - Hugues is a DZone MVB and is not an employee of DZone and has posted 42 posts at DZone. You can read more from them at their website. View Full User Profile

OutOfMemoryError – Oracle Service Bus Alerts

12.28.2012
| 3959 views |
  • submit to reddit

This case study describes the complete steps from root cause analysis to resolution of a sudden OutOfMemoryError problem experienced with Oracle Service Bus 2.6 running on Weblogic Server 9.2.

This case study will also demonstrate how you combine Thread Dump analysis and Solaris PRSTAT to pinpoint culprit Thread(s) consuming large amount of memory on your application Java Heap.

Environment specifications
-  Java EE server: Oracle Service Bus 2.6 & Weblogic Server 9.2
-  OS: Solaris 10
-  JDK: Sun HotSpot 1.5 SR7
-  RDBMS: Oracle 10g
-  Platform type: Enterprise Service Bus

Troubleshooting tools
-  Quest Foglight for Java EE
-  Java VM Thread Dump (for stuck Thread analysis)
-  Solaris PRSTAT (for Thread CPU contributors breakdown)

Problem overview
-  Problem type: OutOfMemoryError

An OutOfMemoryError problem was observed in our Oracle Service Bus Weblogic logs along with major performance degradation.

Preventive restart was then implemented since a Java Heap memory leak was initially suspected which did not prove to be efficient.

Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
·  What is the client impact? HIGH
·  Recent change of the affected platform? No
·  Any recent traffic increase to the affected platform? No
·  Since how long this problem has been observed?  Since several months with no identified root cause
·  Is the OutOfMemoryError happening suddenly or over time? It was observed that the Java Heap is suddenly increasing and can then take several hours before failing with a OutOfMemoryError
·  Did a restart of the Weblogic server resolve the problem? No, preventive restart was attempted which did not prove to be efficient

-  Conclusion #1: The sudden increase of the Java Heap could indicate long running Thread(s) consuming a large amount of memory in a short amount of time
-  Conclusion #2: A Java Heap leak cannot be ruled out at this point

Java Heap monitoring using Foglight
Quest Foglight monitoring tool was used to monitor the trend of the Oracle Service Bus Java Heap for each of the managed servers. You can see below a sudden increase of the Java Heap OldGen space can be observed.

This type of memory problem normally does not indicate a linear or long term memory leak but rather rogue Thread(s) consuming large amount of memory in a short amount of time. In this scenario, my recommendation to you is to first perform Thread Dump analysis before jumping too quickly to JVM Heap Dump analysis.
Thread Dump analysis
Thread Dump snapshots were generated during the sudden Java Heap increase and analysis did reveal a primary culprit as you can see below.

Analysis of the different Thread Dump snapshots did reveal a very long elapsed time of this Thread along with a perfect correlation with our Java Heap increase. As you can see, this Thread is triggered by the ALSB alerting layer and is involved in read/write of the Weblogic internal diagnostic File Store.

 "pool-2-thread-1" prio=10 tid=0x01952650 nid=0x4c in Object.wait() [0x537fe000..0x537ff8f0]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:474)
        at weblogic.common.CompletionRequest.getResult(CompletionRequest.java:109)
        - locked <0xf06ca710> (a weblogic.common.CompletionRequest)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:657)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:630)
        at weblogic.diagnostics.archive.wlstore.PersistentRecordIterator.fill(PersistentRecordIterator.java:87)
        at weblogic.diagnostics.archive.RecordIterator.fetchMore(RecordIterator.java:157)
        at weblogic.diagnostics.archive.RecordIterator.hasNext(RecordIterator.java:130)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.getAlerts(AlertLoggerImpl.java:157)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.updateAlertEvalTime(AlertLoggerImpl.java:140)
        at com.bea.wli.monitoring.alert.AlertLog.updateAlertEvalTime(AlertLog.java:248)
        at com.bea.wli.monitoring.alert.AlertManager._evaluateSingleRule(AlertManager.java:992)
        at com.bea.wli.monitoring.alert.AlertManager.intervalCompleted(AlertManager.java:809)
        at com.bea.wli.monitoring.alert.AlertEvalTask.execute(AlertEvalTask.java:65)
        at com.bea.wli.monitoring.alert.AlertEvalTask.run(AlertEvalTask.java:37)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Solaris PRSTAT
Solaris PRSTAT was also captured in order to understand the amount of CPU consumed and determine if this Thread was stuck or performing a lot of processing.

As you can see below, Thread id #76 was identified as the primary CPU contributor with 8.6 %.

prstat –L <Java Pid>

  PID USERNAME  SIZE  RSS STATE  PRI NICE  TIME  CPU PROCESS/LWPID
  9116 bea  3109M 2592M sleep  59  0  21:52:59 8.6% java/76
  9116 bea  3109M 2592M sleep  57  0  4:28:05 0.3% java/40
  9116 bea  3109M 2592M sleep  59  0  6:52:02 0.3% java/10774
  9116 bea  3109M 2592M sleep  59  0  6:50:00 0.3% java/84
  9116 bea  3109M 2592M sleep  58  0  4:27:20 0.2% java/43
  9116 bea  3109M 2592M sleep  59  0  7:39:42 0.2% java/41287
  9116 bea  3109M 2592M sleep  59  0  3:41:44 0.2% java/30503
  9116 bea  3109M 2592M sleep  59  0  5:48:32 0.2% java/36116
  9116 bea  3109M 2592M sleep  59  0  6:15:52 0.2% java/36118
  9116 bea  3109M 2592M sleep  59  0  2:44:02 0.2% java/36128
  9116 bea  3109M 2592M sleep  59  0  5:53:50 0.1% java/36111
  9116 bea  3109M 2592M sleep  59  0  4:27:55 0.1% java/55
  9116 bea  3109M 2592M sleep  59  0  9:51:19 0.1% java/23479
  9116 bea  3109M 2592M sleep  59  0  4:57:33 0.1% java/36569
  9116 bea  3109M 2592M sleep  59  0  9:51:08 0.1% java/23477
  9116 bea  3109M 2592M sleep  59  0  10:15:13 0.1% java/4339
  9116 bea  3109M 2592M sleep  59  0  10:13:37 0.1% java/4331
  9116 bea  3109M 2592M sleep  59  0  4:58:37 0.1% java/36571
  9116 bea  3109M 2592M sleep  59  0  3:13:46 0.1% java/41285
  9116 bea  3109M 2592M sleep  59  0  4:27:32 0.1% java/48
  9116 bea  3109M 2592M sleep  59  0  5:25:28 0.1% java/30733

Thread Dump and PRSTAT correlation approach
In order to correlate the PRSTAT data please follow the steps below:

-  Identify the primary culprit / contributor Thread(s) and locate the Thread Id (DECIMAL format) from the last column
-  Convert the decimal value to HEXA since Thread Dump native Thread id is in HEXA format
-  Search from your captured Thread Dump data and locate the Thread native Id >> nid=<Thread Id HEXA>

For our case study:
-  Thread Id #76 identified as the primary culprit
-  Thread Id #76 was converted to HEXA format >> 0x4c
-  Thread Dump data Thread native Id >> nid=0x4c

Such correlation and finding did confirm that this rogue Thread was actively processing (reading/writing to the File Store) and allocating a large amount of memory on the Java Heap in a short amount of time.

Root cause
The above analysis along with Mark Smith Oracle Blog article finding did confirm the root cause.

Oracle Service Bus allows Alert actions to be configured within the message flow (pipeline alerts). These pipeline alert actions generate alerts based on message context in a pipeline, to send to an alert destination. Such alert destination is the actual Weblogic diagnostic File Store which means this structure will grow over time depending of the volume of Alerts that your OSB application is generating.

It is located under >> //domain_name/servers/server_name/data/store/diagnostics/

In our case, the File Store size was around 800 MB.

Such increase of the diagnostic File Store size over time is leading to an increase of elapsed time of the Thread involved in read/write operations; allocating large amount of memory on the Java Heap. Such memory cannot be garbage collected until Thread completion which is leading to OutOfMemoryError and performance degradation.

Solution
2 actions were required to resolve this problem:

-  Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one
-  Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level

The reset of the diagnostic File Store did bring some immediate relief and by ensuring short and optimal diagnostic File Store operations and not causing too much impact on the Java Heap.

The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.

Regular monitoring and purging of the diagnostic File Store will also be implemented in order to prevent such performance degradation going forward; as per Oracle recommendations.

Conclusion and recommendations

I hope this article has helped you understand how to identify and resolve this Oracle service bus problem and appreciate how powerful Thread Dump analysis and Solaris PRSTAT tool can be to pinpoint this type of OutOfMemoryError problem.

Please do not hesitate to post any comment or question.



Published at DZone with permission of Pierre - Hugues Charbonneau, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)