Monday, September 23, 2013

Fixing X issues with jconsole and jvisualvm under ubuntu

This is merely a technical post describing how to solve the issues with running the aforementioned jdk tools on ubuntu without X servers installed.

It is sometimes possible that you need to run X based apps on ubuntu servers that do not have graphical libraries no GUI installed.

The easy way to check whether your ubuntu server is missing any libraries is to run the command recommended on stackoverflow.com:



jvisualvm -J-Dnetbeans.logger.console=true


This command will output names of shared libraries that are required to run the command but are missing. For example these libraries could be: libxrender1, libxtst6, libxi6. Their names can be printed also with .so. suffixes, like so: libXrender.so.1.

In order to install them you can run:



sudo apt-get install libxrender1
sudo apt-get install libxtst6
sudo apt-get install libxi6


After installing a library keep running the jvisualvm command above to see if there are any libraries still missing.

If all libraries are in order the jvisualvm should start given that you have connected to your ubuntu server with ssh -X command line parameter which will stream the graphical command's GUI output to your client machine.

Have fun monitoring!


Saturday, September 7, 2013

Solr usability contest: make Apache Solr even cooler!

In august I took part in the Solr Usability Contest ran by Alexandre Rafalovitch, author of the Apache Solr for Indexing Data How-to book from Packt.

As I have already told Alexandre (@arafalov), it was a great idea to launch a contest like this. While Solr / Lucene mail-lists serve as a direct way of solving particular problems and Apache jira is a way of doing some feature requests and bug submissions, it is great to sometimes take a step back and have a look at a larger perspective of features / limitations / possible improvements and so on.

The following three are the winning suggestions of truly yours:

 

On atomic updates

It coincided that we have been evaluating new sexy sounding atomic updates feature and found out, that it wasn't easy to enable it. To actually make use of the feature, we essentially would've needed to make *all* fields stored. The past several years I have been "fighting" against storing fields unless really necessary. It is amongst one of the performance suggestions to avoid storing fields if possible that in turn helps avoiding extra disk seeks. Not all have SSD disks installed on their Solr servers. Having written some Lucene level code some time ago I could wildly guess, why would all fields be necessary stored for the atomic updates. Essentially upon atomic update Solr will have to retrieve an existing document with all its source values, update a value (or values) and push the document back into persistent storage (index). To my taste this describes a bulk update. There is one major advantage of atomic updates (given that all fields were made stored): saving on the network traffic. Indeed, instead of submitting an entire document with a couple updated fields, you can send only the fields with new values and provide a document id. There are other cool features, like setting a previously non-existent field on the document or deleting an existing field. These all will surely make an atomic update feature appealing to some folks. You will find real examples of how to use atomic updates feature in the Alexandre's book. So go and get your copy now.

In the course of reindexnig our data in solr4 we have found out a lot of improvements, one of them is index compression. The lossless compression algorithm used in Lucene / Solr 4 is lz4, which has made the index super compact. Our use case shows 20G vs 100G index size compression in solr4 vs solr3 battle, which is simply amazing. By the way the algorithm has a property of fast decompression (fast decoder), which makes it an ideal fit for an online algorithm.

In light of compactness of the index we are still considering to evaluate the atomic updates feature, merely from three perspectives:
  • traffic savings
  • speed of processing
  • index size increase vs storing only necessary fields

On interactivity of Solr dashboard

As we have started evaluating goodies of Solr4 I was positively surprised about how usable and eye-catchy looking the Solr dashboard (admin) has become. Above all is usability (it is after all usability contest) and, oh yes, it has become usable for the first time. In Solr 1.4.1 and 3.4 times we have been merely consulting the cache statistics page and analysis page occasionally. In Solr4 one can now administer the cores directly from the dashboard, optimize indices, study the frequency characteristics of text data and so on. This is of course on top of mentioned features, like field analysis and monitoring the cache stats.

But.. something is still missing. We are running several shards with frontend solrs and for us it has always been a bit of a pain to monitor our cluster. We intentionally do not use SolrCloud, because of requirements for logical sharding. Sometime ago I have blogged about Solr on FitNesse, which helped to see the situation with the cluster with just one click. We have also set up RAM monitoring with graphite, but wait, all of these are external to Solr tools. It would be really great to be able to integrate some of them directly into Solr dashboard. So we hope this will change into the direction of "plug-n-play" type of interfaces that would allow implementing plugins to Solr dashboard. In the mean time good ol' jvisualvm is a tool helping to monitor a heavy shard during soft-commit runs:



On scripting capability

I also dared to fantasize about what could open Solr up for wider audience. Especially people that are not dreaming of reading and changing Solr source code. This can be enabled with a scripting capability. By this I mean a way of hacking into Solr via external interfaces in the language that fits your task and skillset best (ruby or scala or some other JVM friendly language or perhaps something outside JVM family altogether). The best thing this would offer is an opportunity to experiment fast with the Solr search: changing runtime order of analyzers or search components, affecting on scoring, introducing advertisement entries, calculating some analytics, refining facets etc etc. While some of these may sound too far stretched, the feature in general may open up for changing the Solr core behaviour without hacking into the heavy-duty source code recompilation (although personally I would recommend diving into that anyway).

 

Concluding remarks

I would like to conclude that Solr4 has brought lots of compelling features and improvements (an extremely great soft-commit feature, for example) and we are happy to see this blazingly fast search platform to evolve that fast. In these three usability suggestions I have tried to summarize what is great to do to make the platform even more compelling and cool.

yours truly,


Friday, September 6, 2013

Monitoring Solr with graphite and carbon


This blog post requires graphite, carbon and python to be installed on your *ux. I'm running this on ubuntu.

http://graphite.wikidot.com/
https://launchpad.net/graphite/+download


To setup monitoring RAM usage of Solr instances (shards) with graphite you will need two things:

1. backend: carbon
2. frontend: graphite

The data can be pushed to carbon using the following simple python script.

In my local cron I have:

1,6,11,16,21,26,31,36,41,46,51,56 * * * * \
   /home/dmitry/Downloads/graphite-web-0.9.10\
          /examples/update_ram_usage.sh

The shell script is a wrapper for getting data from the remote server + pushing it to carbon with a python script:

scp -i /home/dmitry/keys/somekey.pem \
    user@remote_server:/path/memory.csv \ 
    /home/dmitry/Downloads/MemoryStats.csv

python \
  /home/dmitry/Downloads/graphite-web-0.9.10\
    /examples/solr_ram_usage.py

An example entry in the MemoryStats.csv:

2013-09-06T07:56:02.000Z,SHARD_NAME,\
  20756,33554432,10893512,32%,15.49%,SOLR/shard_name/tomcat

The command to produce a memory stat on ubuntu:

COMMAND="ssh user@remote_server pidstat -r -l -C java" | grep /path/to/shard 


The python script is parsing the csv file (you may want to define your own format of the input file, I'm giving this as an example):

import sys
import time
import os
import platform
import subprocess
from socket import socket
import datetime, time

CARBON_SERVER = '127.0.0.1'
CARBON_PORT = 2003

delay = 60
if len(sys.argv) > 1:
  delay = int( sys.argv[1] )

sock = socket()
try:
  sock.connect( (CARBON_SERVER,CARBON_PORT) )
except:
  print "Couldn't connect to %(server)s on port %(port)d, is carbon-agent.py running?" % { 'server':CARBON_SERVER, 'port':CARBON_PORT }
  sys.exit(1)

filename = '/home/dmitry/Downloads/MemoryStats.csv'

lines = []

with open(filename, 'r') as f:
  for line in f:
    lines.append(line.strip())

print lines
 
lines_to_send = []

for line in lines:
  if line.startswith("Time stamp"):
    continue
  shard = line.split(',')
  lines_to_send.append("system."+shard[1]+" %s %d" %(shard[5].replace("%", ""),int(time.mktime(datetime.datetime.strptime(shard[0], "%Y-%m-%dT%H:%M:%S.%fZ").timetuple()))))

#all lines must end in a newline
message = '\n'.join(lines_to_send) + '\n'
print "sending message\n"
print '-' * 80
print message
print
sock.sendall(message)
time.sleep(delay)

After the data has been pushed you can view it in graphite GWT based UI. The good thing about graphite vs jconsole or jvisualvm is that it persists data points so you can view and analyze them later.




For Amazon users, an alternative way of viewing the RAM usage graphs is with CloudWatch, although at the moment of this writing it allows storing 2 weeks worth of data only.