Thursday, December 6, 2012

Java Garbage Collector magic in action (or how to improve your java code using jconsole and jmap)

In my Java experience it has been somewhat unobvious how to jump from monitoring GC and fancy memory graphs with tools like jconsole to actually improving your code.

The ingredient I was missing apparently before was jmap that is part of JDK.

What the tool does is that it allows you to attach to a live java process by process id (pid) and output the histogram of its objects. Here is how it works:


jmap -histo 22170 > histo_22170.log

In the example, 22170 is the java process pid and command line option -histo makes jmap to output a histogram of objects. One nice thing about jmap is that it allows you build an object histogram on JVM OutOfMemory crash (see some details here).

The first lines of the histo_22170.log look like this, before some bug fixing has been done to the code (more on this in a moment):

 num     #instances         #bytes  class name
----------------------------------------------
   1:      88210219     2943436216  [B
   2:       5198407      455421864  [[B
   3:       5162015      123888360  com.mysql.jdbc.ByteArrayRow
   4:       2005702       95264296  [C
   5:       2006883       64220256  java.lang.String
   6:         37819       53525280  [I
   7:        309332       44543808  com.mysql.jdbc.Field
   8:        923280       36931200  java.util.TreeMap$Entry
   9:         18744       25864528  [Ljava.lang.Object;
  10:        471843       15098976  java.util.HashMap$Entry
  11:         36577        5195248  [Ljava.util.HashMap$Entry;
  12:         18196        4512608  com.mysql.jdbc.JDBC4PreparedStatement
  13:         18196        3202496  com.mysql.jdbc.JDBC4ResultSet
  14:         54308        2606784  java.util.TreeMap
  15:         12809        1903312  
  16:         36571        1755408  java.util.HashMap
  17:         12809        1750136  
  19:         18196        1164544  com.mysql.jdbc.PreparedStatement$ParseInfo

I have marked the relevant parts of the histogram with the bold font. The java process was doing some heavy-duty task for thousands of files and talking to the MySQL DB in a loop to load some meta-information for each of the file. The process was given 4GB max heap size and was not properly finishing, producing OutOfMemory error that in turn crashed the JVM.

The code snippet that was producing this looked like this:

PreparedStatement sqlStatement = sqlConnection.prepareStatement(
                          "SELECT * FROM SOME_TBL WHERE SOME_ID=?");
for(int i = 0; i < some_number_less_than_100; i++) {
    sqlStatement.setString(1, companyIds.get(i));
    ResultSet sqlResult = sqlStatement.executeQuery();
    if (sqlResult != null) {
     while (sqlResult.next()) {
        // do some processing of the query results here
     }
    }
}

Intuitively by now you should feel that something is wrong with the code around the JDBC object management.

Let's have a look on the bolded parts from the top of the object histogram. Apparently, the trending JDBC related objects are com.mysql.jdbc.Field with 309332 instances, com.mysql.jdbc.JDBC4PreparedStatement with 18196 instances and com.mysql.jdbc.JDBC4ResultSet with 18196 instances. Two latter objects have exactly same number of instances and that is reflected in our code, where both objects are re-created in a loop. The visual monitoring tool jconsole was showing constant RAM usage growth and Eden Heap Space being saturated with lots of young objects, while the Survivor Heap Space was not trending at all.

What's missing is releasing the JDBC resources, by calling close() methods on both PreparedStatement and ResultSet.

So let's correct the code:

PreparedStatement sqlStatement = sqlConnection.prepareStatement(
                           "SELECT * FROM SOME_TBL WHERE SOME_ID=?");
for(int i = 0; i < some_number_less_than_100; i++) {
    sqlStatement.setString(1, companyIds.get(i));
    ResultSet sqlResult = sqlStatement.executeQuery();
    if (sqlResult != null) {
     while (sqlResult.next()) {
        // do some processing of query results here
     }
     // missing lines added
     sqlResult.close();
}
sqlStatement.close();

After the two missing statements have been added (sqlResult.close() and sqlStatement.close()), the DB resources started to release properly and the original process began to work properly, without big spikes in RAM usage. The JDBC related objects have also disappeared from the top of the histogram:

 num     #instances         #bytes  class name
----------------------------------------------
   1:       1301417       66169256  [C
   2:       1333656       42676992  java.lang.String
   3:        158410       29981072  [I
   4:        382702       12246464  java.util.HashMap$Entry
   5:        116080        5668216  [B
   6:        230584        5534016  java.lang.StringBuffer
   7:         56760        3632640  java.util.regex.Matcher
   8:         18320        2620552 
   9:         18320        2501360 
  10:           588        2187960  [Ljava.util.HashMap$Entry;
  11:          1460        1733744 
  12:         33570        1523672 
  13:         19495        1247680  java.util.regex.Pattern
  14:         19729        1201136  [Ljava.lang.Object;
  15:          1460        1130688 
  16:         19477        1090712  [Ljava.util.regex.Pattern$GroupHead;
  17:          1312        1080992 
  18:         19096         614976  [Ljava.lang.String;
  19:         18642         596544  java.util.RandomAccessSubList
  20:         18642         596544  java.util.AbstractList$ListItr

Now the process is happily completing with reasonable RAM usage:


Click the image to make it bigger
The diagram shows that Eden Heap Space became much more free of young objects and the Survivor Heap Space gets utilized more. See here, if you want more details on various pools of Heap and Non-Heap memory.

Interestingly enough, this bug was hiding for months in the code base and only manifested itself once more data had to be processed. This made the process to run longer and thus reach and overflow the allocated RAM bounds.

This trivial example shows the importance of monitoring your heavy (and not so) java processes.

Happy monitoring!

Wednesday, June 6, 2012

Berlin buzz words 2012: impressions

This year I have had a unique chance to participate in the Berlin buzz words conference for the first time. In brief, it is the event where search, store and scale people come together to exchange on the recent ideas / developments in the area. I must say that the conference level simply amazed me: the quality of the presentations and the audience maturity have clearly aligned together.

Urania building, the venue


To me, as a Solr / Lucene user and developer it was especially fun to meet in person people I have previously only seen on the mail-lists or in video talks on the Internet. These, in particular, include (in my case): Otis Gostpodnetić, Uwe Schindler, Simon Willnauer, Robert Muir, Grant Ingersoll, Ted Dunning, Rafał Kuć. There've been new folks I haven't heard of previously and got inspired by their presentations, like Alex Lloyd from Google and Markus Weimer from Microsoft (opps, GOOG and MSFT in the same sentence). Got to see sematext guys in action at their SPM booth.


Opening session kicks in


The wi-fi worked everywhere, which is unnatural usually to other conferences. Yet, I kept my laptop at a hotel in order to force myself do three things: 1) actually listen to the presenter and ask questions via mike or in person; 2) occasionally take pictures; 3) network during the coffee-breaks.

First day's keynote session by Leslie Hawthorn


As a result: I took some amount of pictures; felt less distracted and tired at the end of each day; asked questions from the audience and got (probably) recorded on the video and many more questions in person; networked with leaders in their areas to actually perceive how things are going in their communities. SO this is to say, that in the end, what mattered to me was people and not only the technologies they have talked about.

Eric Evan's presentation


Some observations (probably interesting more to the conference orgs), pros and cons mixed:
1) The personal badge could have name on each side because of two reasons: it tends to always flip so that the name isn't visible and second - the map on the other side of it was useless, because it was easy to learn where each auditorium was.
2) Food was great and free beer / ice-cream / snacks by sponsors -- awesome addition.
3) Small auditoriums tended to have been super-packed and the only big one have been super sparse (excluding opening and closing sessions). Could be addressed somehow next year?
4) 20 minutes talks have been a surprise for the presenters that expected to have 40 min. The result is usually running out of time to ask any questions from the audience and presenters getting slowly to the core of their presentation.
5) Party on Monday evening and cute small surprises on the bus seats from wooga were cool!

There was also sometime left to explore the beautiful city of Berlin and of course eat Schnitzel!







Thanks to all the #bbuzz team for excellent experience and hoping to come next year!
yours truly,

Saturday, June 2, 2012

(first?) virtual presentation on Dialogue conference

Just participated in one of the biggest Russian conferences which fuses together theoretic and applied linguists, Dialogue'12. This time I couldn't come there in person, so instead we decided with @vporoshin to try out some modern technology. The selection was pretty easy: skype Finland->Russia, directed through speakers onto microphone connected to an amplifier. Also injected a photo of myself to add to "physical" presence. The conference organizers have appreciated utilizing new advanced technologies in presenting scientific papers. Here is the presentation (no author's photo there, you had to be present on the conference to see it):


Tuesday, May 8, 2012

Paper on rule-based sentiment accepted!

My paper on rule-based sentiment was accepted to Dialog'2012, special section on ROMIP'2011. The ROMIP had a track on 2-way and 3-way sentiment classification of texts in Russian last year. In our team with @vporoshin we had three major systems:

1. Rule-based described in the paper.
2. Modified multinomial Naive Bayes trained on unigrams and bigrams.
3. Classifier ensemble of the two above.

Rule-based approach largely relies on the pre-crafted polarity dictionary. It means, that it knows only those polarity word sequences, that it has in the dictionary. The MNB classifier in contrast learns such sequences from training set. They also have other differences. MNB is in a way a bag-of-words approach, but may work surprisingly well. In 2-way classification it has shown accuracy of 90+% for one of the domains. The rule-based algorithm has interesting linguistic features, like object oriented sentiment detection. Although this first time, the ROMIP's sentiment tracks did not require an object oriented detection, the test data had an object name (e.g. movie title or product name) attributed to each text to classify. Both object oriented and general sentiment detection has performed equally well and above 50% (i.e. above the accuracy of a coin tossing method). Overall accuracy of the general rule-based classification is 63% with 92% precision for the positive class. This generally means that more polarity words should be mined for the negative class and the existing negative polarity dictionary revised (some words could be of positive or ambiguous polarity).

Some more numbers in the paper:

Sunday, March 18, 2012

Scientifc agenda of this year

This year stays promising in terms of the scientific happenings, first of all, I participated in the ROMIP contest on sentiment analysis. It was intense and interesting to dive into annotated and test data. More on this later, once information ready.

On the other note, this year's step up was to have been accepted on the committees list of the Second International Symposium on Business Modeling and Software Design (http://www.is-bmsd.org/). The research topics include and are not limited to the following:

BUSINESS MODELS AND REQUIREMENTS
- Business Analysis - Value Models and Process Models
- Essential Business Models
- Re-usable Business Models
- Relating Business Goals to Requirements
- Business Process Coordination
- Business Entities and Business Roles
- Business Data and Semantics
- Business Rules
- Behavior Modeling and Pragmatics
- Identification and Elicitation of Requirements
- Domain-imposed and User-defined Requirements
- Requirements Analysis

BUSINESS MODELS AND SERVICES
- Business Modeling and Service Science
- Relating Business Goals to the Identification of Services
- Service Modeling - Technology-independent and Platform-specific
- Business Rules and Service Composition
- Autonomic Service Behavior
- Context-aware Service Behavior
- Re-usable Service Models

BUSINESS MODELS AND SOFTWARE
- Business Modeling -driven Derivation of Software
- Business Innovation and Software Evolution
- Business-IT Alignment and Traceability
- Re-usable Business Models and Software Components
- Business Rules and Software Specification
- Business Goals and Software Integration
- Autonomic and Context-aware Business/Software Systems

INFORMATION SYSTEMS ARCHITECTURES
- Enterprise Architectures
- Service-Oriented Architectures
- Architectural Styles
- Architectural Viewpoints
- Crosscutting Concerns

Monday, January 16, 2012

My experience with airBaltic

UPD: Please read the entire post. I will not remove the story line written originally, because this is exactly what has happened. However airBaltic contacted me on the phone themselves and told about positive resolution of the case. Please read on.

Original story:
-----
First of all, I would like to assure you that I'm not the best at blaming, meaning I simply don't like doing it publicly. It's probably unfair to only publicly blame an air operator and never praise them. But that's how it works. A happy customer doesn't compile an entire blog post about how cool it was to fly with a certain operator. "If I'm happy, I stay silent" principle. But believe me, if the case I'll tell you here about would resolve positively, I wouldn't hesitate to blog about it.

Here is the case. We planned a 3 days trip to Moscow from Helsinki and back together with my wife. Using skyscanner we've found the cheapest option: fly with airBaltic via Riga. Quick friend survey, all's good, settled. I went online and started my ticket search last Saturday. The cheapest option was to depart on 16:25. Chosen that, prepared to pay. After double-checking dates and times it struck me: nope, no good, departure set to 8:25. Cancelled search, started all over. First question here: is it bad luck or bad system? You choose.
Re-ran my search, all is good, paid 531 euros. But when I printed the travel receipt, this time it REALLY struck me: the return flight date was set to one month later! Another bad luck or system fault? This time I'm inclined to choose the second option. No problems, calling to the Finnish office. "On the weekends office is closed". I decided to call first thing next Monday morning. This is my first mistake and I admit it: should have called to Latvia and pay some euro and a half a minute to change the date.

Calling first thing Monday morning: young lady's voice, I described her the problem. She refused to change the date without an additional fee. I have asked her to connect me to her manager. After a couple of minutes (yeah-yeah, customer is on the first place), manager's voice: teaching and preaching me how I should have used their system. "On every page of multi-page ticket booking process, you can look on your right and check the departure and return flight dates and times." All right, thanks! But look, attempted I to explain the system fault: "Unless you really travel back in one month,there is no way to choose that different month without extra movements. No does the system suggest you the best return flights from in a month period!" This was simply noise for her and she continued teaching me how to use the system. I asked her to give me her manager / director. Guess what was the answer: "This is not possible". "Why?". "I'm sorry, but this is not possible." Being a customer, I'm pretty sure, I can talk to almost any worker of the company, who stays in the customer relations line. This time it is your fault, airBaltic.

"I would like to change the return flight date back to what it should be". But the lady tought me another time: "This is only possible if you pay 150 euros. You should have called us on Saturday and explained the problem." Which I did! And the Finnish office was closed. Is this really my problem now? I doubt it. Because, it is YOUR REPRESENTATIVE, airBaltic! What if I wouldn't have an opportunity to call abroad (yes, Riga is abroad to me) and pay for an international call? And if it didn't work, make sure to take my call on Monday morning seriously, attend to it and make an exception or a good men deal. What on Earth does this rule "if you called after two days, it cannot be changed without a fee" policy mean? Do you want to keep a customer or loose it? What do you loose by changing the month standing away date? Afraid not to find any cusomer during an entire month?

"And if I cancel the entire trip, what sum can be refunded?" "You get 76 euros back". Excellent.

Without further ramblings, I would like to publicly thank airBaltic for 531 euros worth "stay at home and don't travel with us" service. It has really taught me not to use your services. Ever.
Everything seems to be mortal in this wolrd, and airBaltic's serivce will die as well. But by making this type of "friendly" customer service and policies you only bring the end faster.

Good luck and enjoy 531-76 euros for not taking us where we wanted.
-----

UPDATE to the story: airBaltic continues working on the case, here is what they posted on twitter: @DmitryKan Dmitry, your case is not closed. Please give us a bit more time and colleagues will come back to you.

UPDATE 2: The case has been resolved. I have received a call from airBaltic, where they said that the return flight date was changed without an extra fee. I don't know was it a result of my social media activity since yesterday evening, but airBaltic service was extremely fast and accurate this time. Since all the posts I have done on the Internet about airBaltic link here, the landed people will read these updates as well. Thank you, airBaltic.