tag:blogger.com,1999:blog-70936769931328657932024-03-13T02:09:45.441+02:00Finland transmits, that...Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.comBlogger159125tag:blogger.com,1999:blog-7093676993132865793.post-63001701777903558082019-12-15T18:51:00.000+02:002019-12-15T23:00:13.319+02:00Conferences in 2019<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="font-family: "arial" , "helvetica" , sans-serif;">As Christmas and New Year holidays are coming up, I wanted to reflect on two conferences in two different countries -- Estonia and Ukraine, that I had a pleasure to participate and / or organize this year.</span><br />
<h2 style="text-align: left;">
<span style="font-family: "arial" , "helvetica" , sans-serif; font-size: large;">AINL</span></h2>
<div style="text-align: justify;">
AINL Conference<span style="background-color: white; color: #1c1e21; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 14px;"> (</span><a data-ft="{"tn":"-U"}" data-lynx-mode="async" data-lynx-uri="https://l.facebook.com/l.php?u=https%3A%2F%2Fainlconf.ru%2F2019%2Fprogram%3Ffbclid%3DIwAR2SqtRSzELCxCKz-w8QXpRwEET3iRSkCb49T4YXjKxbYRGLinnWCmD_ROA&h=AT0WH1GQJU0rGh2okHf7qW88pSD7n6o-NDBvmiaG7_UJTJGwtkZwQkYbk-6TqBIbxLHUZ9BClVbp0ezImtLfvT29S2TWyDOKg2Rmzosj8TT1rvDkINEpDhZ9Ccv_rDwSOcUxCzZzQRuZPEYegxkrzcMyNhtxtfiJdGuHUuJzI6dTEvEzmUVszDOQ3pbf_QN0-rfna7DqH3fWuhlWuAJzI2gVrmO1YXNg0Ju6BMUysI3FH7gLBpRUXAkW9zHcMDyyYMS0t2--O1an9UUbIUYVUVxHTKBQ-CmYlu_eDXjotb7V8UTtJPN1ERYnOssSyqNVU--FLbWc7VLdAP-8xZl-GCyp-EShM5-l3IbmY-gSLCc1bhdN7vQfQ7yiGsnM9cW65newI5JY7ayGh2hakbIuNku9NLkLo2S5C2tOD7ESca7yJrd9QfjaAG4bHRi--SN_KIoZJWgGdqfgbpZS3cwjV8Kkdto8s00SdDufrfo3HDDIefCdpZGrf4-VlomJ2k43ZI_bzeVS9WrlNJcAtbLn" href="https://ainlconf.ru/2019/program?fbclid=IwAR2SqtRSzELCxCKz-w8QXpRwEET3iRSkCb49T4YXjKxbYRGLinnWCmD_ROA" rel="noopener nofollow" style="background-color: white; color: #385898; cursor: pointer; font-family: system-ui, -apple-system, system-ui, ".SFNSText-Regular", sans-serif; font-size: 14px; text-decoration-line: none;" target="_blank">https://ainlconf.ru/2019/program</a><span style="background-color: white; color: #1c1e21; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 14px;">), </span><span style="background-color: white; color: #1c1e21; font-size: 14px;"><span style="font-family: "arial" , "helvetica" , sans-serif;">held in Tartu in November, focused a lot on applying deep learning to NLProc, with two tutorials by <a href="https://github.com/dustalov" target="_blank">Dmitry Ustalov</a> (Yandex) on <a href="https://zenodo.org/record/3510160#.XfZaCdYzZmA" target="_blank">Crowdsourcing on Language Resources and Evaluation</a> and by <a href="https://scholar.google.com/citations?user=Jq4Wq7AAAAAJ" target="_blank">Andrey Kutuzov</a> (University of Oslo) on Diachronic Word Embeddings for Semantic Shifts Modelling. Andrey Kutuzov's tutorial was practical and involved some Python coding, resulting in a pull </span></span><span style="background-color: white; color: #1c1e21; font-family: , , , ".sfnstext-regular" , sans-serif; font-size: 14px;">request: </span><a href="https://github.com/wadimiusz/diachrony_for_russian/pull/5">https://github.com/wadimiusz/diachrony_for_russian/pull/5</a> <span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;">that I submitted for the task of comparing semantic shifts in meaning between Soviet and Post Soviet eras. This code uses Jaccard similarity as a local method for detecting shifts in meaning. There are also global methods, like Procrustes alignment, the only downside of which is it is slower, than Jaccard. You can read more detail on the task in Andrey's AINL <a href="https://github.com/wadimiusz/diachrony_for_russian/blob/master/ainl_slides.pdf" target="_blank">slides</a>.</span></div>
<span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/--0Tkb_fVjPw/XfZgEn9le9I/AAAAAAAA6l0/V0TZMiQFpnIP-Kn_Ny7FEv-mROegMIrzQCLcBGAsYHQ/s1600/AINL_Dmitry.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/--0Tkb_fVjPw/XfZgEn9le9I/AAAAAAAA6l0/V0TZMiQFpnIP-Kn_Ny7FEv-mROegMIrzQCLcBGAsYHQ/s400/AINL_Dmitry.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Credit: Dmitry Kan</td></tr>
</tbody></table>
<span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;"><br /></span>
<span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;"><br /></span>
<br />
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">In terms of submitted <a href="https://link.springer.com/book/10.1007/978-3-030-34518-1" target="_blank">papers</a> -- the review process was double-blind and involved at least 3 reviewers per paper. The result was 30% acceptance rate and 12 out of 40 papers that did make it, focused on data acquisition and annotation, human-computer interaction, statistical NLProc (including paper by </span></span><span style="background-color: #f9f9f9; color: #0d0d0d; font-family: "roboto" , "arial" , sans-serif; font-size: 14px; white-space: pre-wrap;"><a href="http://ansis.lv/about.en.php">Ansis Bērziņš</a></span><span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;"> on usage of speech recognition for determining language similarity -- </span><a href="https://www.youtube.com/watch?v=ZWITA1pdHX4&fbclid=IwAR0dqAoLQuNulIqpn8M2e8KBRrFyC6H_6AX5VwtuNmeeUprSCJnGlE5Lcoc" style="font-family: Arial, Helvetica, sans-serif; font-size: 14px;" target="_blank">video</a><span style="background-color: white; color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: 14px;">) and neural language models (one of the works for morpheme segmentation using Bi-LSTM model cited the work of <a href="https://researchportal.helsinki.fi/en/persons/mathias-creutz" target="_blank">Mathias Creutz</a> with whom we worked at AlphaSense 2010-2016).</span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Last day of the conference focused on the industrial applications of AI in NLProc. By invitation of <a href="https://scholar.google.com/citations?user=j3PoGH4AAAAJ&hl=en" target="_blank">Lidia Pivovarova</a> (University of Helsinki) I presented on the search engine and NLProc work we've done at <a href="https://www.alpha-sense.com/" target="_blank">AlphaSense</a>, including smart synonyms, sentiment analysis, named entity recognition and salience resolution, theme modelling and high-precision search.</span></span></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">One of the challenges for the industrial presentation was that it had to last for 1,5 hours. If you consider your audience ability to focus only for 40 minutes, you have got to do something else than 65 slides. I decided to make about 30 slides and then handle the rest of my talk with Q&A. The outcome has been very surprising to myself, because the audience did want to learn details of AlphaSense product, making the Q&A last for 50 minutes. Quite a few questions I managed to answer with the product itself -- this sparks genuine interest in understanding the UI of an AI product powering the financial industry. I hope this was beneficial for the audience to dive into the workflows of financial knowledge workers and how NLProc can help solve their daily routine tasks better.</span></span></div>
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span>
<br />
<h2 style="text-align: left;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif; font-size: large;"><span style="background-color: white;">Customer Development Marathon</span></span></h2>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Customer development is the topic that interests me from the point of the product development. Just recently I've learnt about <a href="http://jobstobedone.org/">jobs-to-be-done</a> approach to mining for real jobs that your customers hire your product for. One example with which Clayton Christensen of Harvard Business School <a href="https://youtu.be/Ei57yFEljrI?t=2118" target="_blank">motivates</a> this approach is the job that male consumers of milkshakes had on the their way to work every day: stay engaged in life during monotonous driving and stay full until 10 a.m.</span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">The conference (or <a href="https://www.facebook.com/events/457065744917151/" target="_blank">marathon</a> as we called it) on customer development attracted 70 participants at <a href="https://www.facebook.com/ihubworld/" target="_blank">iHUB</a> co-working center in Kyiv, Ukraine. Speakers from various established companies -- <a href="https://www.facebook.com/YouScan/" target="_blank">YouScan</a>, <a href="https://www.facebook.com/macpaw/" target="_blank">MacPaw</a>, <a href="https://www.facebook.com/PromoRepublic/" target="_blank">PromoRepublic</a>, <a href="https://www.facebook.com/competera/" target="_blank">Competera</a>, <a href="https://www.facebook.com/AlphaSenseInc/" target="_blank">AlphaSense</a>, <a href="https://www.facebook.com/kyivstar/" target="_blank">Kyivstar</a>, <a href="https://www.facebook.com/terrasoft.ua/" target="_blank">Terrasoft</a>, <a href="https://www.facebook.com/pmlab.ua/" target="_blank">PMLab</a>, <a href="https://www.facebook.com/Portmone.com.ua/">Portmone.com</a>, <a href="https://www.facebook.com/weblium/">Weblium</a>, <a href="https://www.facebook.com/varusua/">VARUS</a>, <a href="https://www.facebook.com/SendPulseRu/">SendPulse</a>, <a href="https://www.facebook.com/pages/EVOcompany/2173722432886699?nr">EVO.company</a> -- presented 5 min talks about specific cases on engaging with their customers to grow conversion, retention and happiness with their products. Following the presentations, the discussion panels dug deeper into how to implement a customer-centric business. </span></span></div>
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://1.bp.blogspot.com/-3uIvS0Z0XXc/XfZfl2N7z2I/AAAAAAAA6ls/k803zHjKL7wiwviCfgGny2N31-pHvNn6gCLcBGAsYHQ/s1600/CustDev_Maria.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1200" data-original-width="1600" height="300" src="https://1.bp.blogspot.com/-3uIvS0Z0XXc/XfZfl2N7z2I/AAAAAAAA6ls/k803zHjKL7wiwviCfgGny2N31-pHvNn6gCLcBGAsYHQ/s400/CustDev_Maria.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Credit: Maria Kudinova</td></tr>
</tbody></table>
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span>
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span>
<br />
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">We've organized the marathon in 3 panels: </span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span></div>
<div style="text-align: justify;">
</div>
<ol>
<li><span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Idea. Analysis. Validation </span></span></li>
<li><span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Creation. Delivery. Launch and</span></span></li>
<li><span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Sales. Feedback. Innovation. </span></span></li>
</ol>
<br />
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Each of these panels focused on a particular stage of product development from idea to post-sale feedback and innovation loop. The audience learnt about how to conduct an efficient user interview, what tools help reach out to new or existing clients, how not to push your product into consulting or outsource, how to establish an internal company-wide communication to stay on the same page when shaping the product, marketing and sales around customer needs.</span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="color: #1c1e21; font-family: "arial" , "helvetica" , sans-serif;"><span style="background-color: white; font-size: 14px;">Both events were full of networking, meeting new and familiar faces in the industry and academia and learning a lot. For anything you aspire to build next year, focusing on real value and ease of use of your NLP / AI / search products, and thinking what job your users hire your products for will help you serve them better.</span></span></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com1tag:blogger.com,1999:blog-7093676993132865793.post-15144087952962028202019-10-24T16:10:00.000+03:002019-10-24T16:10:37.624+03:00Eight thoughts on revolutionary changes<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: center;">
<img height="216" src="https://miro.medium.com/max/1766/1*D2mb-Ap82yy3DVacjD42nA.png" width="400" /></div>
<br />
<div style="text-align: justify;">
<em class="mj" style="-webkit-tap-highlight-color: transparent; background-color: white; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-decoration-line: none;"><a class="bn da mf mg mh mi" href="https://en.wikipedia.org/wiki/The_Martian_(film)" rel="noopener" style="-webkit-tap-highlight-color: transparent; background-color: white; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-decoration-line: none;" target="_blank">The Martian</a></em><span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"> is my top favourite movie (and a book) that in action shows the excitement around engineering professions. Mark Watney, being left alone on Mars, fights for his life with all the knowledge and skills he has, from botany to chemistry and physics. Of all engineering professions, software engineering is probably the most booming right now in light of Artificial Intelligence breakthroughs. But does this profession have ethical aspects that we as engineers and humans need to be continuously thinking about?</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">I began to follow the work of </span><a class="bn da mf mg mh mi" href="https://www.ynharari.com/" rel="noopener" style="-webkit-tap-highlight-color: transparent; background-color: white; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-align: start; text-decoration-line: none;" target="_blank">Yuval Noah Harari</a><span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"> and his call to the humanity on a potential big issue we are facing. It does not yet dawn at many of us to start thinking about potential threats to how we operate today. Many of us focus on day-to-day activities and may not have enough time to look beyond.</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">Right now I see two figures on the global scene that publicly speak with some urgency about AI, its potentially disruptive impact and change for the civilisation. Yuval Noah claims we are at a cross-road of getting new types of human beings, that will have supporting AI and biological improvements made to them. Elon Musk </span><a class="bn da mf mg mh mi" href="https://youtu.be/f3lUEnMaiAU?t=1840" rel="noopener" style="-webkit-tap-highlight-color: transparent; background-color: white; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-align: start; text-decoration-line: none;" target="_blank">claims</a><span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"> that AI is way smarter than human already (take Go) and we need to start thinking how to control it. And AI keeps improving for higher and higher degrees of freedom (from checkers, through chess, to Go it is a few orders of magnitude change in degrees of freedom each game allows). And so eventually AI will beat human beings in what is possible. One of the contemporary examples touching me personally is robots ironing clothes or wiping off coffee spills with high level of movement precision and similarity to that of humans:</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><a href="https://twitter.com/pabbeel/status/1115642512603897856" target="_blank">https://twitter.com/pabbeel/status/1115642512603897856</a></span></div>
<div style="text-align: justify;">
<br /></div>
<br />
<blockquote class="twitter-tweet">
<div dir="ltr" lang="en">
Very excited to announce a project 3 years in the making: BLUE, a low-cost, safe, capable robot designed from the ground up with AI in mind. <br />
<br />
Want BLUE? You can now sign up for priority access here: <a href="https://t.co/bdK7ouRABK">https://t.co/bdK7ouRABK</a><br />
<br />
Learn more here: <a href="https://t.co/vB0i39aHTE">https://t.co/vB0i39aHTE</a><br />
<br />
w/<a href="https://twitter.com/Cthephen?ref_src=twsrc%5Etfw">@Cthephen</a> <a href="https://t.co/Ucxqyylzue">pic.twitter.com/Ucxqyylzue</a></div>
— Pieter Abbeel (@pabbeel) <a href="https://twitter.com/pabbeel/status/1115642512603897856?ref_src=twsrc%5Etfw">April 9, 2019</a></blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><br />
<br />
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">But I’m sure Musk and Harari mean more than that.</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">A simple example Harari gives is: the IoT devices will record your pulse / level of endorphines as you see your political leader and so the government will know, how happy and faithful you are towards your leadership. Or what ads to show you depending on your sexual orientation based on what you have written / read / watched (even well before you understand your orientation yourself).</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">When more and more AI powered robots will take away the routine tasks, we as humanity will have two development paths: wear complacency and become lazier or seek creativeness. The first path is always reachable, especially in the time when we would work 3 days a week, 4 hours a day (by Jack Ma’s </span><a class="bn da mf mg mh mi" href="https://youtu.be/f3lUEnMaiAU?t=1107" rel="noopener" style="-webkit-tap-highlight-color: transparent; background-color: white; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-align: start; text-decoration-line: none;" target="_blank">prediction</a><span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">). Will AI thus become even more dominant and take the lead over humanity? As by Musk eventually AI will write its own software and will be way more efficient in it, than modern AI engineers. And at some point humans having slower interfaces to produce / consume data and knowledge will be left behind AI and it can turn into a catch up game.</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;">Given these potential issues that automation with AI is posing, those of us who focus on automation and AI touch on the ethical boundaries of our work. If you will, we are participating in the launch and acceleration of an AI revolution, that might not be visible for all people on the Planet yet. But we need to be aware of changing the society fabrics through rewriting job markets, work skills in demand and allowing new types of human / robot elites control people around them with AI.</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: rgba(0 , 0 , 0 , 0.84); font-family: , "georgia" , "cambria" , "times new roman" , "times" , serif; font-size: 21px; letter-spacing: -0.084px;"><br />
</span></div>
<div class="lr ls ck aq lt b lu lv lw lx ly lz ma mb mc md me" data-selectable-paragraph="" id="f96c" style="background-color: white; box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; margin-bottom: -0.46em; margin-top: 2em;">
I would like to share a different perspective on revolutions, Planet-agnostic:</div>
<ol class="" style="background-color: white; box-sizing: inherit; color: rgba(0, 0, 0, 0.8); font-family: medium-content-sans-serif-font, -apple-system, system-ui, "Segoe UI", Roboto, Oxygen, Ubuntu, Cantarell, "Open Sans", "Helvetica Neue", sans-serif; list-style: none none; margin: 0px; padding: 0px;">
<li class="lr ls ck aq lt b lu lv lw lx ly lz ma mb mc md me ml mm mn" data-selectable-paragraph="" id="1a7b" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 2em; padding-left: 0px;">Making an industry level revolution is hard. There are many reasons, one of which is simply human laziness. Who in their sane mind would want to change anything in the production process, when it is comfortable as is?</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="75c1" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">The revolution force should be so strong, that it is able to cover individual and collective laziness / resistance and still be obvious to anyone involved, that it is a change for good.</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="2610" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">As we progress into the future, multiple of these (small and big) revolutions will make life easier and hence more lazy participants will emerge.</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="4281" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">When laziness saturates, what direction the next revolution will take and will it really optimise for making things globally sustainable (like climate or flying to Mars) or locally to cater to individual’s needs to make us even more complacent?</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="f55d" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">This leaves true revolution breakthroughs to unsettled minds, challenging everything they see. Which makes such people highly uncomfortable for lazy ones to be around.</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="95b0" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">And naturally, the unsettled minds don’t have much time to enjoy the results of their doing (assuming the time span of a revolution is less than their lifespan).</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="2d38" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">Yet lazy will eventually benefit from these revolutions and become lazier.</li>
<li class="lr ls ck aq lt b lu mo lw mp ly mq ma mr mc ms me ml mm mn" data-selectable-paragraph="" id="ff44" style="box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; list-style-type: decimal; margin-bottom: -0.46em; margin-left: 30px; margin-top: 1.05em; padding-left: 0px;">The question then is: how to optimise for a global goal, while making as many on the Planet involved to keep knowledge and revolution results more evenly distributed?</li>
</ol>
<div class="lr ls ck aq lt b lu lv lw lx ly lz ma mb mc md me" data-selectable-paragraph="" id="e196" style="background-color: white; box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; margin-bottom: -0.46em; margin-top: 2em;">
I thank Derek Kannenberg and Tatiana Batanina for reading drafts of this essay and providing constructive feedback and thoughts.</div>
<div class="lr ls ck aq lt b lu lv lw lx ly lz ma mb mc md me" data-selectable-paragraph="" id="e196" style="background-color: white; box-sizing: inherit; color: rgba(0, 0, 0, 0.84); font-family: medium-content-serif-font, Georgia, Cambria, "Times New Roman", Times, serif; font-size: 21px; letter-spacing: -0.004em; line-height: 1.58; margin-bottom: -0.46em; margin-top: 2em;">
<em class="mj" style="box-sizing: inherit; letter-spacing: -0.084px;">Originally published at </em><a class="bn da mf mg mh mi" href="https://www.linkedin.com/pulse/eight-thoughts-revolutionary-changes-dmitry-kan/" rel="noopener" style="-webkit-tap-highlight-color: transparent; background-image: url("data:image/svg+xml; background-position: 0px calc(1em + 1px); background-repeat: repeat-x; background-size: 1px 1px; box-sizing: inherit; http: //www.w3.org/2000/svg\"><line x1=\"0\" y1=\"0\" x2=\"1\" y2=\"1\" stroke=\"rgba(0, 0, 0, 0.84)\" /></svg>"); letter-spacing: -0.084px; text-decoration-line: none;" target="_blank"><em class="mj" style="box-sizing: inherit;">https://www.linkedin.com</em></a><em class="mj" style="box-sizing: inherit; letter-spacing: -0.084px;">.</em></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-50287394622928738082019-04-29T16:10:00.000+03:002019-05-06T17:29:16.092+03:00Company culture<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Long gone are the days, when company culture did not matter or was a second-class citizen. Today, when choosing a company to work for, above all you choose the culture (may be even not realizing it and thinking that you are after <i>technology </i>or <i>product</i>). When you look at the job openings or office photos with employee smiles and general cheering atmosphere you will likely not see the culture of the company. You may get a glimpse of it during the interview process, but it is not enough.</div>
<div style="text-align: justify;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://www.incimages.com/uploaded_files/image/970x450/getty_485778829_9705939704500128_57914.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="371" data-original-width="800" height="148" src="https://www.incimages.com/uploaded_files/image/970x450/getty_485778829_9705939704500128_57914.jpg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Credit: <a href="https://www.inc.com/marla-tabaka/7-elements-of-a-great-company-culture.html">https://www.inc.com/marla-tabaka/7-elements-of-a-great-company-culture.html</a></td></tr>
</tbody></table>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Why is culture important?</h3>
<div>
What is culture? Citing <a href="https://en.wikipedia.org/wiki/Culture" target="_blank">Wikipedia</a>:</div>
<div>
<br /></div>
<blockquote class="tr_bq">
Culture (<span class="nowrap" style="background-color: white; color: #222222; font-family: sans-serif; font-size: 14px; white-space: nowrap;"><span class="IPA nopopups noexcerpt"><a href="https://en.wikipedia.org/wiki/Help:IPA/English" style="background: none; color: #0b0080; text-decoration-line: none !important;" title="Help:IPA/English">/<span style="border-bottom: 1px dotted;"><span title="/ˈ/: primary stress follows">ˈ</span><span title="'k' in 'kind'">k</span><span title="/ʌ/: 'u' in 'cut'">ʌ</span><span title="'l' in 'lie'">l</span><span title="/tʃ/: 'ch' in 'China'">tʃ</span><span title="/ər/: 'er' in 'letter'">ər</span></span>/</a>) is the social behavior and norms found in human societies.</span></span></blockquote>
<br />
<div style="text-align: justify;">
To me company culture boils down to every day activities, like running projects, exchanging information and planning. I do not think that culture can be imposed. Observing it and declaring core values however makes sense.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You can make a simple test to see the edges of your culture: if you have two employees sharing the same language talking in the kitchen and third one -- of different nation -- enters, will the first two switch to a common to all three language? You can argue they don't have to. And this is where the culture begins: is it inclusive? Is it about socializing together or in smaller groups? This in turn will most likely affect on collaborativeness level amongst these groups during real projects.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Beyond language, there are many aspects of culture that directly impact the results of a company. Take decision making for one. When conflicting parties meet to discuss a pressing matter -- how will they exchange ideas? In what fashion will they criticize each other's ideas?<br />
<br />
<blockquote class="twitter-tweet" data-lang="en">
<div dir="ltr" lang="en">
How to criticize successfully: <a href="https://t.co/m91LVIReP5">pic.twitter.com/m91LVIReP5</a></div>
— Nat Friedman (@natfriedman) <a href="https://twitter.com/natfriedman/status/1117155734611824640?ref_src=twsrc%5Etfw">April 13, 2019</a></blockquote>
<script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script><br />
<br />
Why all this is of such importance? Well, it depends. Some will say -- "we don't care about the internal kitchen of how a result was achieved". But you can also ask yourself: what kind of place you would like to work at? Is it something where everyone contributes their share and want to be heard? Or is it the place where everyone (100% inclusively) knows that sharing ideas will be supported no matter how smart or stupid an idea is? "What does it matter, if the result is what it is" -- you may argue. If you are building a great place to work, you would like to make it great for everyone, not just a few. And the hardest part is to re-evaluate your culture as if you just joined the company (really hard, I know).<br />
<br />
<h3>
Ingredients of a solid culture</h3>
<div>
Building a good culture is a <i>process</i> that should evolve. There will be new people joining and new teams forming around these people. Through the lens of working in several small and mid-size companies (with hundreds of employees) I found the following ingredients of a company culture to really make a difference specifically for IT product companies.<br />
<br />
1. <b>Readiness to support</b>. Do your engineers willingly walk an extra mile to support each other? Beyond sprint meetings, calls with product managers and status updates. Do they walk around the office / scan work chat and ask if they could help in passing? If there is a culture of support it prompts a lot of things to happen, like idea formation, knowledge sharing and general positive vibe in the office / chat. This combined makes work not only fun, but also more efficient. Again what to watch for here is "abuse" of helpers. It is best to be measured and rewarded in some way to see, if all tune in to the same way of collaboration.<br />
<br />
2. <b>Ego and ability to acknowledge own mistakes</b>. It is not infrequent when knowledgeable people will tend to speak up / show off just because they carry the knowledge and may come across as arrogant. The cultural aspect to watch is ego. Less the better for <i>internal communication / collaboration. </i>This way you achieve higher level of inclusiveness -- everyone can learn from each other without mental punishment and is free to impact on larger architectures. Connected with low ego is a rare skill of acknowledging own mistakes. First, it shows that you value learning process. And second, you are human, not a metal-made robot that only improves. You can make a mistake, you are human and can communicate that freely. This shows a great example to peers, that making mistakes is not going to produce a career impacting drama and more -- they can even be rewarded for mistakes, because it is a crucial knowledge bit. Share it on a weekly demo! You may save time for your team.<br />
<br />
3. <b>Knowledge sharing sessions</b>. Surprisingly a lot of information flies past engineers, when they are not involved in that particular project. Knowledge sharing sessions are the key. But not only within the team / adjacent departments. Overall on a company level. It is the venue to convey a large message, related to a process update in how tickets are filed or a way to document your component / feature. Or a way to break down a release. Taking first two points -- it is also a way to share some painfully earned bit or a glorious bit of system design, that would not be acknowledged by pretty much anybody unless light is shed on it with good colour.<br />
<br />
4. <b>Meaningful meetings</b>. Meetings without prep are time eaters. Save 30 min for a prep-less meeting and ask relevant parties to prep for the next one. If you can avoid a meeting, do! It is way better for an engineer to go read a blog post on some tech / algorithm / system or spend extra time figuring out more test cases for their code. Don't waste their time by asking their statuses, unless it leads to a good discussion. There are other ways to share statuses over work chat for instance. How making meetings a king may kill your culture? People will evade them or use smart phones to mentally "fly away" while the other dude on the team shares a status. If you do such meetings, cap them at 15 min and ask everyone to put down their phones, listen and participate.<br />
<br />
5. <b>Culture of retros</b>. Retrospectives (after a milestone or project completion) are a great way to achieve two things: a. Understand and plan on improving what went wrong. b. Release stress after a tough milestone / project. Saying that something needs an improvement in passing over a chat / email / call will lead to 0% positive outcome.<br />
<br />
6. <b>Have all folks in the company equally accessible</b>. The early days of a startup enjoy full connectedness. So easy to lean over to a next desk and ask a question -- to everybody. The bigger the company, the harder it becomes: vastly different time zones, narrow focus in teams, "busyness" syndromes. If you are a top manager, take all efforts possible to be available for a chat. It will only help to retain good level of bonding and will help information circulate in the company. Consider it as a constant retraining of your staff. Sharing info in the memos? Might be the only way in triple digit headcount company. But better personal 1-1's.<br />
<br />
7. <b>Maintain wide focus</b>. The issue I've seen in IT product companies (can apply to other industries too) is narrowing focus over time. The winning argument is: it helps with velocity of development. But there is a downside too: engineers will tend to narrow down their view of the product and eventually degrade as professionals. Generating ideas of how to improve the product, practicing dogfooding are the gateways to keep engineers motivated, learning and contributing. Roles are great because they identify the responsible drivers of a particular functionality. But getting input and feedback loop from thinkers and tinkerers (engineers) can push the product to new frontiers.<br />
<br />
<br />
I hope these culture ingredients are useful to consider in your company. What other large cultural aspects you maintain in your company? Feel free to share!</div>
</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com2tag:blogger.com,1999:blog-7093676993132865793.post-17231240374953070822018-12-12T15:46:00.003+02:002019-01-18T13:59:30.147+02:00Automatic writing with Deep Learning: Progress<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
This is a continuation of the post <a href="https://dmitrykan.blogspot.com/2018/05/automatic-writing-with-deep-learning.html">https://dmitrykan.blogspot.com/2018/05/automatic-writing-with-deep-learning.html</a>. This item was reblogged at Writer's DZone: <a href="https://dzone.com/articles/automatic-writing-with-deep-learning-progress">https://dzone.com/articles/automatic-writing-with-deep-learning-progress</a><br />
<br />
Fast forward few months (apologies for the delay) I can share some findings.</div>
<div style="text-align: justify;">
Again, I think, we should take AI co-writer exercises with a grain of salt. However, during this time I have come across practical usage example areas for such systems.</div>
<br />
<div style="text-align: justify;">
One of them is augmentation of a news article writer. More specifically, when writing a news item, one of the most challenging tasks is to coin a catchy title. Does the title have some trendy phrases in it? Or perhaps it mentions an emerging topic, that captures attention at this given moment? Or reuses a pattern that worked well for this given author? Or just spurs an idea in the author's head?</div>
<div style="text-align: justify;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://www.rogerwilco.co.za/sites/default/files/inline-images/robot%20writer.jpeg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="450" data-original-width="800" height="180" src="https://www.rogerwilco.co.za/sites/default/files/inline-images/robot%20writer.jpeg" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Copyright: <a href="https://www.rogerwilco.co.za/blog/robot-writers-how-ai-will-affect-copywriting">https://www.rogerwilco.co.za/blog/robot-writers-how-ai-will-affect-copywriting</a></td></tr>
</tbody></table>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In the following exercise I have set a very modest goal: train a co-writer on previously written texts with an attempt to suggest something useful from them. I could imagine, that this could be extended to texts that are trending or a collection of particularly interesting titles. What have you.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
To train such a model I have used Robin Sloan's RNN writer: <a href="https://github.com/robinsloan/rnn-writer">https://github.com/robinsloan/rnn-writer</a>. The goodies of the project are:</div>
<div style="text-align: justify;">
</div>
<ul>
<li>Trained on <a href="http://torch.ch/" target="_blank">Torch</a>. Nowadays, Torch is leveraged via <a href="https://pytorch.org/" target="_blank">PyTorch</a>, a deep learning Python library that is nearing its production readiness time.</li>
<li>The trained model gets exposed into an <a href="https://atom.io/" target="_blank">Atom</a> -- pluginable editor (I'd imagine, real writers would want to have the model integrated into their favourite editor, like Word).</li>
<li>API is available too to integrate into custom apps (and this is exactly how it is integrated with Atom).</li>
</ul>
<div>
<br /></div>
<div>
I will skip the installation of Torch and training the network and proceed to examples. The rnn-writer github repository has a good set of instructions to proceed with. I have installed Torch and trained the model on a Mac.<br />
<br />
First things first: RNN trained on my Master's Thesis "Design and Implementation of Peer-to-Peer Network" (University of Kuopio, 2007).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i9.ytimg.com/vi/XvFMIIjBHa0/default.jpg?sqp=COyNu-AF&rs=AOn4CLAaP9Ens33IEHWmWbNVi-DQ31Z0jg" frameborder="0" height="266" src="https://www.youtube.com/embed/XvFMIIjBHa0?feature=player_embedded" width="320"></iframe></div>
<br />
<div style="text-align: justify;">
The text of the Master's Thesis is about 50 pages in English with diagrams and formulas. On one hand, having more data makes NNs learn more word representations and should have larger probability space to predict next word given the condition of the current word or phrase. On the other hand, limiting the input corpus to phrases that have certain domain goal, like writing an email, could leverage a clean set of phrases that a user employs in many typical email passages.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As I got an access to Fox articles, I thought, this could warrant another RNN model and a test. Something to share next time.</div>
<div style="text-align: justify;">
<br /></div>
</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-2109373497428017022018-05-06T19:54:00.000+03:002018-05-27T18:42:21.846+03:00Automatic writing with Deep Learning: Preface<div dir="ltr" style="text-align: left;" trbidi="on">
<style type="text/css">
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #454545}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; text-align: justify; font: 12.0px 'Helvetica Neue'; color: #454545}
</style>
<br />
<div class="p1">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">This article was also reblogged at: <a href="https://dzone.com/articles/automatic-writing-with-deep-learning-preface">https://dzone.com/articles/automatic-writing-with-deep-learning-preface</a></span><br />
<br />
<br />
<span style="font-family: "verdana" , sans-serif;">Quite many machine and deep learning problems are directed at building a mapping function of roughly the following form:</span></div>
</div>
<div class="p1">
<br /></div>
<div class="p1">
<br /></div>
<div class="p1">
<span style="font-family: "verdana" , sans-serif;">Input <b>X</b> ---> Output <b>Y</b>,</span></div>
<div class="p1">
<br /></div>
<div class="p1">
<br /></div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">where:</span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><b>X</b> is some sort of an object: an email text, an image, a document; </span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><b>Y</b> is either a single class label from a finite set of labels, like spam / no spam, detected object or a cluster name for this document or some number, like salary in the next month or stock price.</span></div>
</div>
<div class="p1">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
</div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">While such tasks can be daunting to solve (like sentiment analysis or predicting stock prices in realtime) they require rather clear steps to achieve good levels of mapping accuracy. Again, I'm not discussing situations with lack of training data to cover the modelled phenomenon or poor feature selection.</span></div>
</div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
</div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">In contrast, somewhat less straightforward areas of AI are the tasks that present you with a challenge of predicting as fuzzy structures as words, sentences or complete texts. What are the examples? Machine translation for one, natural language generation for another. One may argue, that transcribing audio to text is also such type of mapping, but I'd argue it is not. Audio is a "wave" and the speech detection is an okay solved task (with state of the art above 90% of accuracy), however such an algorithm does not capture the meaning of the produced text, except for where it is necessary to do the disambiguation of what was said. Again, I have to make it clear, that audio->text problem is not at all easy with its own intricacies, like handling speaker self corrections, noise and so on.</span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-gRDU-NZtsk4/Wu8zA33LXgI/AAAAAAAAsPY/gJGRAR-8lZEQBHLDikMp7DGdJfJtkLvIwCLcBGAs/s1600/ae10a4e8340382b878a6a534f105f2fcxxl.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1073" data-original-width="1280" height="268" src="https://3.bp.blogspot.com/-gRDU-NZtsk4/Wu8zA33LXgI/AAAAAAAAsPY/gJGRAR-8lZEQBHLDikMp7DGdJfJtkLvIwCLcBGAs/s320/ae10a4e8340382b878a6a534f105f2fcxxl.jpg" width="320" /></a></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
</div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
</div>
<div class="p2">
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">Lately, the task of writing texts with a machine (e.g. <a href="https://medium.com/deep-writing/how-to-write-with-artificial-intelligence-45747ed073c" target="_blank">here</a></span><span style="text-align: left;"><span style="font-family: "verdana" , sans-serif;">)</span></span><span style="font-family: "verdana" , sans-serif;"> caught my eye on twitter. Previously, papers from Google on writing <a href="https://research.google.com/pubs/pub36745.html" target="_blank">poetry</a> or other text producing software were giving me creepy feelings. I somehow undermined the role of such algorithms in the space of natural language processing and language understanding and saw only diminishing value of such systems to users. Again, any challenging tasks might be solved and even bring value to solving other challenging tasks. But who would use an automatic poetry writing system? Why would somebody, I thought, use these systems -- just for fun? My practical mind battled against such "fun" algorithms. Again, making an AI/NLProc system capable of producing anything sensible is hard. Take the task of sentiment analysis, where it is quite unclear what the agreement between experts is, not to mention non-experts.</span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: "verdana" , sans-serif;">I think this post has poured enough of text onto the heads of my readers. I will use this post as a self-motivating mechanism to continue the research with systems producing text. My target is to complete the neural network training on the text from my Master thesis and show you some examples for your judgement of the usefulness of such systems.</span></div>
</div>
<br /></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-25728611034053561622018-05-05T21:44:00.000+03:002018-05-05T21:44:38.624+03:00AI for lip reading<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Verdana, sans-serif;">It is exciting to push your imagination for where else can you apply AI, machine learning and most certainly -- deep learning, that is so popular these days. I came across <a href="https://www.quora.com/How-would-you-train-a-convolutional-LSTM-network-to-lip-read/answer/Dmitry-Kan" target="_blank">this question on quora</a> that provoked me to think a bit how would one go about training a neural network to lip read. I don't actually know what made me answer this question more: that found myself in an unusual context sitting on an Angularjs meetup at Google offices in New York City (after work, usual level tired) or the question itself. Whatever the reason, here is my answer:</span></div>
<div style="text-align: justify;">
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://images.theconversation.com/files/89910/original/image-20150728-11549-1xw04nt.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=926&fit=clip" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="461" data-original-width="800" height="230" src="https://images.theconversation.com/files/89910/original/image-20150728-11549-1xw04nt.jpg?ixlib=rb-1.1.0&q=45&auto=format&w=926&fit=clip" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: <a href="http://theconversation.com/our-lip-reading-technology-promises-to-make-hearing-aids-more-human-45166">http://theconversation.com/our-lip-reading-technology-promises-to-make-hearing-aids-more-human-45166</a></td></tr>
</tbody></table>
<div style="text-align: justify;">
<span style="font-family: Verdana, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: #333333; font-family: Verdana, sans-serif; font-size: 15px;">I would probably first start with formalizing what is lip reading process from a human understandable algorithm point of view. May be it is worth to talk to a professional, like a spy or something. Obviously you need training data. Understanding, what is lip reading from the algorithm perspective will affect on what data you need.</span></div>
<div style="text-align: justify;">
<span style="background-color: white; color: #333333; font-size: 15px;"><span style="font-family: Verdana, sans-serif;"><br /></span></span></div>
<div style="text-align: justify;">
</div>
<ol></ol>
<br />
<ol style="text-align: left;">
<li><span style="font-family: Verdana, sans-serif;">To read a word of several syllables you’d need a sequence of anchor lip positions, that represent syllables. Or probably vowels / consonants. See, I don’t know, which one is best. But you’d need to start with the lowest level possible out of which you can compose larger sequences, like letters -> syllables -> words. Let’s call these states.</span></li>
<li><span style="font-family: Verdana, sans-serif;">A particular lip posture (is that the right word?) will most probably map to ambiguous states.</span></li>
<li><span style="font-family: Verdana, sans-serif;">Now the interesting part is how to resolve the ambiguities. Number 2 produces several options. Out of these you can produce a multitude of words that we can call candidates.</span></li>
<li><span style="font-family: Verdana, sans-serif;">Then you need to score candidates based on some local context information. Here it turns into a natural language understanding.</span></li>
<li><span style="font-family: Verdana, sans-serif;">I'd start with seq2seq.</span></li>
</ol>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-63231408992627050422018-01-16T13:43:00.000+02:002018-01-17T15:48:29.830+02:00New Luke on JavaFX<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Hello and Happy New Year to my readers!</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I'm happy to announce release of completely reimplemented Luke -- using JavaFX technology. Luke is the toolbox for analyzing and maintaining your Lucene / Solr / Elasticsearch index on low level. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The implementation was contributed by <a href="https://twitter.com/moco_beta" target="_blank">Tomoko Uchida</a>, who also did the honors of <a href="https://github.com/DmitryKey/luke/releases/tag/luke-javafx-7.2.0" target="_blank">releasing</a> it.</div>
<br />
<div style="text-align: justify;">
The excitement of this release is supported by the fact, that in this version Luke becomes fully compliant with <b>ALv2 license</b>! And it gets very close to be contributed to Lucene project. At this point we need lots of testing to make sure JavaFX version is on par with the original thinlet based one.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here is how load index screen looks like in new JavaFX luke:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-z8U5AVihP2Y/WlMraMzYmeI/AAAAAAAAp0o/P4Q-0lhTjjgaE8uJLWnRzAEvby0pAKWkgCLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.26.10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="671" data-original-width="900" height="238" src="https://2.bp.blogspot.com/-z8U5AVihP2Y/WlMraMzYmeI/AAAAAAAAp0o/P4Q-0lhTjjgaE8uJLWnRzAEvby0pAKWkgCLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.26.10.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After navigating to the Solr 7.1 index and pressing OK, here is what luke shows:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-JqlGbqkXncI/WlMrjdjCGHI/AAAAAAAAp0s/MRqsdmFNOXIlbMPPxddv2wms2cJFBrZLQCLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.26.38.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="670" data-original-width="898" height="238" src="https://1.bp.blogspot.com/-JqlGbqkXncI/WlMrjdjCGHI/AAAAAAAAp0s/MRqsdmFNOXIlbMPPxddv2wms2cJFBrZLQCLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.26.38.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I have loaded an index of Finnish wikipedia with 1,069,778 documents, and luke tells me that the index does not have deletions and was not optimized. Let's go ahead and optimize it:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-E9H7qEzfgi8/WlMr5STQ6aI/AAAAAAAAp0w/3yApiSN89yw-tbPmB91NkvlRqZ75_vjEgCLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.29.11.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="670" data-original-width="899" height="238" src="https://1.bp.blogspot.com/-E9H7qEzfgi8/WlMr5STQ6aI/AAAAAAAAp0w/3yApiSN89yw-tbPmB91NkvlRqZ75_vjEgCLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.29.11.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-xuN-Hcwltl0/WlMsGr4TcuI/AAAAAAAAp08/sdi12QW5UTUJXW0_lCTOmr6ffPpi5cLTACLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.29.48.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="618" data-original-width="703" height="281" src="https://1.bp.blogspot.com/-xuN-Hcwltl0/WlMsGr4TcuI/AAAAAAAAp08/sdi12QW5UTUJXW0_lCTOmr6ffPpi5cLTACLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.29.48.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Notice, that on this dialogue you can request only expunging of deleted docs, without merging (the costly part for large indices). After optimization's complete, you'll have a full log of actions in front of you to confirm the operation was successful:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-3dpIVHt4iCU/WlMsfVhmAlI/AAAAAAAAp1A/mhyDOX25ihgxi5m_O0NuEqpnsC6ViJWrACLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.31.18.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="645" data-original-width="834" height="247" src="https://3.bp.blogspot.com/-3dpIVHt4iCU/WlMsfVhmAlI/AAAAAAAAp1A/mhyDOX25ihgxi5m_O0NuEqpnsC6ViJWrACLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.31.18.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You could also opt for checking the health of your index via Tools -> Check index menu item:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://1.bp.blogspot.com/-EapW6H_wQ0s/WlNbN8p9_fI/AAAAAAAAp1Y/oK1Md4DQMqc9C1atCcE-TtnU6HZ5DLEhgCLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B13.50.31.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="747" data-original-width="708" height="320" src="https://1.bp.blogspot.com/-EapW6H_wQ0s/WlNbN8p9_fI/AAAAAAAAp1Y/oK1Md4DQMqc9C1atCcE-TtnU6HZ5DLEhgCLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B13.50.31.png" width="303" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Let's move to the Search tab. It has changed slightly in that search box has moved to the right, while search settings and other knobs were moved to the left.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Thinlet version:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-q7wyeIL-Bak/WlSE8e16pHI/AAAAAAAAp2M/fxpmI__WxsQpKhE4_huIC_s3dQ4rosOygCLcBGAs/s1600/Screen%2BShot%2B2018-01-09%2Bat%2B11.00.08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="669" data-original-width="849" height="252" src="https://4.bp.blogspot.com/-q7wyeIL-Bak/WlSE8e16pHI/AAAAAAAAp2M/fxpmI__WxsQpKhE4_huIC_s3dQ4rosOygCLcBGAs/s320/Screen%2BShot%2B2018-01-09%2Bat%2B11.00.08.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
JavaFX version:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://2.bp.blogspot.com/-lNM4XXym2VQ/WlMtf3Gxl3I/AAAAAAAAp1E/4ujRhnhv5a8qi1cojPIScIssiaY1GZL2ACLcBGAs/s1600/Screen%2BShot%2B2018-01-08%2Bat%2B10.32.29.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="811" data-original-width="919" height="282" src="https://2.bp.blogspot.com/-lNM4XXym2VQ/WlMtf3Gxl3I/AAAAAAAAp1E/4ujRhnhv5a8qi1cojPIScIssiaY1GZL2ACLcBGAs/s320/Screen%2BShot%2B2018-01-08%2Bat%2B10.32.29.png" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
It is more intuitive UI now in terms of access to various tools like Analyzer, Similarity (now with access to parameters of new BM25 ranking model, that became default in Lucene and default in luke) and More Like This. There is a new Sort sub-tab that lets you choose a primary and secondary field to sort on. Collectors tab however is gone: please let us know, if you used it for some task -- would love to learn.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Moving on to the Analysis tab, I'd like to draw your attention towards really cool functionality of loading custom jars with your implementation of a character filter, tokenizer or token filter to form your custom analyzer. Test these right in the luke UI without the need to reload shards in your Solr / Elasticsearch installation:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://3.bp.blogspot.com/-v_SUYOm5F04/Wl3juJ3HudI/AAAAAAAAp40/Qr2v5av_v_AXrghu1M0_VAmkbJ_5B9mxwCLcBGAs/s1600/Screen%2BShot%2B2018-01-16%2Bat%2B13.35.21.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="672" data-original-width="895" height="240" src="https://3.bp.blogspot.com/-v_SUYOm5F04/Wl3juJ3HudI/AAAAAAAAp40/Qr2v5av_v_AXrghu1M0_VAmkbJ_5B9mxwCLcBGAs/s320/Screen%2BShot%2B2018-01-16%2Bat%2B13.35.21.png" width="320" /></a></div>
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Last, but not least is Logs tab. Essentially you should have been missing it for as long as luke exists: getting a handle of what's happening behind the scenes during an error case or a normal operation.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In addition, this version of Luke supports the recently released Lucene 7.2.0.</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com4tag:blogger.com,1999:blog-7093676993132865793.post-89657234991105762062017-11-01T21:30:00.000+02:002017-11-01T21:30:23.082+02:00Will deep learning make other machine learning algorithms obsolete?<div dir="ltr" style="text-align: left;" trbidi="on">
The fourth (fifth?) <a href="https://dmitrykan.blogspot.fi/search/label/quora-answers" target="_blank">quoranswer</a> is here! This time we'll talk a bit about deep learning and its role in making other state of the art machine learning methods obsolete.<br />
<br />
<br />
<h3 style="text-align: left;">
Will deep learning make other machine learning algorithms obsolete?</h3>
<br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">I will try to take a look at the question from the natural language processing perspective.</span><br />
<br />
<div style="text-align: justify;">
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">There is a class of problems in NLProc, that might not be benefited from deep learning (DL), at least directly. For the same reasons, machine learning (ML) cannot help so easily. I will give three examples, which share more or less the same property so hard to model with ML or DL:</span></div>
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<div style="text-align: justify;">
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">1. Identifying and analyzing a sentiment polarity oriented towards a particular object: person, brand etc. Example: I like phoneX, but dislike phoneY. If you monitor the sentiment situation for the phoneX you'll expect this message to be positive, while negative polarity for the phoneY. One can argue, it is easy / doable with ML / DL, but I doubt you can stay solely within that framework. Most probably you'll need a hybrid with rule-based system, syntactic parsing etc, which somewhat defeats the purpose of DL: be able to train neural network on a large amount of data without domain (linguist) knowledge.</span></div>
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<div style="text-align: justify;">
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">2. Anaphora resolution. There are systems that use ML (and hence DL can be tried?), like </span><span class="qlink_container" style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;"><a class="external_link" data-qt-tooltip="bart-coref.org" href="http://www.bart-coref.org/index.html" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("//qsf.ec.quoracdn.net/-3-images.new_grid.external_link.svg-26-aef78ead48f1f1e2.svg"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration-line: none;" target="_blank">BART coreference system</a></span><span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;"> , but most of the research I have seen so far is based around some sort of rules / syntactic parsing (this presentation is quite useful: </span><span class="qlink_container" style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;"><a class="external_link" data-qt-tooltip="slideshare.net" href="http://www.slideshare.net/kancho/anaphora-resolution" rel="noopener" style="background-attachment: initial; background-clip: initial; background-image: url("//qsf.ec.quoracdn.net/-3-images.new_grid.external_link.svg-26-aef78ead48f1f1e2.svg"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration-line: none;" target="_blank">Anaphora resolution</a></span><span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">). There is a vast application area for AR, including sentiment analysis and machine translation (also fact extraction, question-answering etc).</span></div>
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<div style="text-align: justify;">
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">3. Machine translation. Disambiguation, anaphora, object relations, syntax, semantics and more in a single soup. Surely, you can try to model all of these with ML, but commercial systems in MT are more or less done with rules (+ml recently). I'm expecting DL can produce advancements in MT. I'll cite one paper here that uses DL and improves on phrase-based SMT: </span><span class="qlink_container" style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;"><a class="external_link" data-qt-tooltip="arxiv.org" data-tooltip="attached" href="http://arxiv.org/abs/1409.3215" rel="noopener nofollow" style="background-attachment: initial; background-clip: initial; background-image: url("//qsf.ec.quoracdn.net/-3-images.new_grid.external_link.svg-26-aef78ead48f1f1e2.svg"); background-origin: initial; background-position: right 0.3em; background-repeat: no-repeat; background-size: 10.5px; color: #2b6dad; padding-right: 15px; text-decoration-line: none;" target="_blank">[1409.3215] Sequence to Sequence Learning with Neural Networks</a> Update: some recent <a href="https://dmitrykan.blogspot.fi/2017/10/more-fun-with-google-machine-translation.html" target="_blank">fun experiment</a> with DL based machine translation.</span></div>
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 0px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">The list can be extended to knowledge bases etc, but I hope I made my point.</span></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com4tag:blogger.com,1999:blog-7093676993132865793.post-86431589321599962572017-10-29T10:29:00.000+02:002017-10-29T10:33:21.292+02:00More fun with Google machine translation<div dir="ltr" style="text-align: left;" trbidi="on">
Having posted in <a href="https://dmitrykan.blogspot.fi/search/label/quora-answers" target="_blank">quoranswer tag </a>specifically on machine translation tricks and challenges + looking at some fun with Mongolian->Russian translation with Google, I decided to experiment with Mongolian->English pair. To make this work, you'd need a Cyrillic keyboard and type only Russian letters 'а' as input on Mongolian language side. Throughout the text I'll refer to Google Translate as "neural network" or "network", as it has been <a href="https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html" target="_blank">known</a> that Google has switched its translation system over to a Neural Network implementation.<br />
<br />
So let's get going. It all starts rather sane:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://4.bp.blogspot.com/-ZhJkLMy_xck/WfWRG8-lktI/AAAAAAAAoGI/z9FL09yb8Bo7VmvKg4XGPhD2PFWLkHbDwCLcBGAs/s1600/Screen%2BShot%2B2017-10-29%2Bat%2B10.27.33.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="398" data-original-width="1600" height="79" src="https://4.bp.blogspot.com/-ZhJkLMy_xck/WfWRG8-lktI/AAAAAAAAoGI/z9FL09yb8Bo7VmvKg4XGPhD2PFWLkHbDwCLcBGAs/s320/Screen%2BShot%2B2017-10-29%2Bat%2B10.27.33.png" width="320" /></a></div>
<br />
<br />
а -> a<br />
аа -> ah<br />
<br />
And as we stack up more letters on the left, we start getting more interesting translations:<br />
<br />
ааа -> Well<br />
аааа -> ahaha<br />
ааааа -> sya<br />
аааааа -> Well<br />
ааааааа -> uh<br />
<br />
and skipping a bit:<br />
<br />
ааааааааа -> that's all<br />
<br />
(at this point you'd imagine that deep neural network had some fun you teasing it and wants you to stop. But no).<br />
<br />
аааааааааа -> that's ok<br />
аааааааааааааа -> that's fine<br />
<br />
ааааааааааааааааа -> everything is fine<br />
<br />
ааааааааааааааааааа -> it's a good thing<br />
<br />
<br />
And a bit more letters stacked up, the network begs to stop again, threatening:<br />
<br />
ааааааааааааааааааааааааааааааааааааа -> it's all over<br />
<br />
Then, after having enough of statements, the network starts asking questions.<br />
<br />
ааааааааааааааааааааааааааааааааааааааааа -> is it a good thing?<br />
<br />
and answers own question:<br />
<br />
аааааааааааааааааааааааааааааааааааааааааа -> it's a good thing<br />
<br />
few comments here and there:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааа -> a good time<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааа-> to have a good time<br />
<br />
Eventually, more dictionary entries crop in:<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to a whirlwind<br />
<br />
And, unexpectedly:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a date<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a living<br />
<br />
Then, the network starts to output:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to make a dicision<br />
<br />
And begs me to put some sane words in instead of the letter non-sense:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> put your own word<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a whistle-blower<br />
<br />
The latter one is probably meant as an offence to add colour to network's ask.<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a private time in the world<br />
<br />
Notice how general words are, like "private", "time", "world". Still they are grammatical and make sense, except unlikely as translations.<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a mortal year<br />
<br />
And to begging again:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a kindness in the world<br />
<br />
Again, all my commentary is meant as fun, I'm not intending to (mis)lead you to something here.<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a dead dog<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> put ā € |<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a deadline<br />
<br />
And more threats, again:<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a hash of you<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a mortal beefed up<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> have a heartbroker<br />
<br />
A heartbroker? Really? Something new.<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a hash of a tree<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to put a lot of light on it<br />
<br />
And finally, the network gets hungry:<br />
<br />
ааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> to have a meal<br />
<br />
And positively concludes:<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a date auspicious<br />
<br />
аааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааааа -> a friend of a thousand years<br />
<br />
Hope you had fun reading these, and please try some for yourselves.</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-42879659109951806222017-10-28T09:50:00.001+03:002017-10-28T09:50:57.159+03:00What are some funny Google Translate tricks?<div dir="ltr" style="text-align: left;" trbidi="on">
This is the third quoranswer blog post, answering the <a href="https://www.quora.com/What-are-some-funny-Google-Translate-tricks">question</a> What are some funny Google Translate tricks? I have decided to update the Google translations based on the current situation. I think they are still a lot of fun. Let me know in comments, if you came across some funny translations!<br />
<br />
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
<br /></div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
There used to be a funny politically coloured trick for Russian->English, where sense was inverted on translation depending on what President names were used in positive vs negative context. I can’t reproduce it right now, but GT produces this at the moment:</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
Обама не при чём, виноват Путин.</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
human: Obama is innocent, Putin is to blame.</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
GT: Obama has nothing to do with Putin. (Previously in Aug 4, 2016: "Obama is not to blame, blame Putin.")</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
Путин не при чём, виноват Обама</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 1em; padding: 0px;">
human: Putin is innocent, Obama is to blame.</div>
<div class="qtext_para" style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; padding: 0px;">
GT: Putin has nothing to do with Obama's fault. (Previously in Aug 4, 2016: "Putin is not being Obama's fault.")</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-57551659582380454142017-10-24T22:11:00.000+03:002017-10-25T14:07:32.903+03:00What grammatical challenges prevent Google Translate from being more effective?<div dir="ltr" style="text-align: left;" trbidi="on">
Here is one more Quora question on the exciting topic of machine translation and my <a href="https://www.quora.com/What-grammatical-challenges-prevent-Google-Translate-from-being-more-effective-Is-there-a-set-of-broad-grammatical-rules-which-decreases-its-efficacy-How-can-these-challenges-be-overcome-Is-it-possible-to-fully-automate-good-quality-translation/answer/Dmitry-Kan?share=1&srid=CuD">answer</a> to it.<br />
<br />
The question had some sub-questions:<br />
<br />
<ul style="text-align: left;">
<li>Is there a set of broad grammatical rules which decreases its efficacy?</li>
<li>How can these challenges be overcome? Is it possible to fully automate good quality translation?</li>
</ul>
<div>
<br /></div>
<div>
Below is my answer, hoping it will be interesting to learn about machine translation and different language pairs. Note, that translations given currently by Google Translate might differ from below as they were obtained in 2013. UPD: and they do! See comments to this post.</div>
<div>
<br /></div>
<div>
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">(these examples are using GT only, no human interpretation involved)</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">English: I am in a shop.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Dutch: Ik ben in een winkel.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">back to English I'm in a store. (quite ok)</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">English: I danced into the room.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Dutch: Ik danste in de kamer.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">back to English: I danced in the room. (preposition issues)</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Finnish: Hän on kaupassa.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">English: He is in the shop.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Finnish: Hän on myymälä. (roughly the original Finnish sentence)</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Let's pick another example of unrelated languages: English and Russian.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Russian: Маска бывает правдивее и выразительнее лица.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).</span><br />
<br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Another pair and different example:</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">English: Reporters said that IBM has bought Lotus.</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Japanese: 記者は、IBMがロータスを買っていると述べた。</span><br />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">back to English: The reporter said that IBM Lotus are buying.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Japanese has a "recursive syntax", that represents this English sentence, like:</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">Reporters (IBM Lotus has bought) said that.</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).</span><br />
<br style="color: #333333; font-family: q_serif, Georgia, Times, "Times New Roman", "Hiragino Kaku Gothic Pro", Meiryo, serif; font-size: 15px; margin-bottom: 0px;" />
<span style="color: #333333; font-family: "q_serif" , "georgia" , "times" , "times new roman" , "hiragino kaku gothic pro" , "meiryo" , serif; font-size: 15px;">To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).</span></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com2tag:blogger.com,1999:blog-7093676993132865793.post-41860523626497235942017-09-23T12:03:00.000+03:002017-09-23T12:05:57.915+03:00What's a good topic for a bachelor's thesis in Sentiment Analysis?<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 style="text-align: left;">
Preamble</h3>
<div style="text-align: justify;">
Over the past few months (soon close to a year) you, my readers, might have noticed decline in frequency of my blogging. There are few reasons, including practical (absence of time), but still the most two important are:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. Blogger has not developed too much as a tool over time. It probably continues to be relatively popular and bringing some ad money, so Google did not shut it down. Moving over to medium.com might be a better idea in order to produce visually "shinier" posts and actually enjoy writing.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. There are other interesting and more interactive ways to share one's knowledge. One of such, that I personally like, is quora.com. The site offers a reverse model compared to blogging: you answer questions. This way you ensure, that at least the questioner will read your answer, but so might do other respondents. Rating of your answers is another component, that contributes to statistics and getting analogy of payment - credits, that you can later use for instance for boosting your answers to a larger audience. But I would say the latter is of lesser importance to me.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Since I have never actually figured out, whether Quora allows you to read posts without being registered, re-posting my answers here from time to time could be a good way to also maintain this blog alive.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
So here we go (slightly edited version):</div>
<br />
<h3 style="text-align: left;">
What's a good topic for a bachelor's thesis in Sentiment Analysis?</h3>
<div>
<div style="text-align: justify;">
Apart from applying deep neural networks to sentiment analysis being exciting, another topic that is exciting both from research and practice perspective is sarcasm detection. It goes somewhat outside of the topic of sentiment analysis per se out to the opinion mining. Sentiment analysis precision and recall are affected by the sarcastic posts. This is because sarcastic posts tend to be positive on the surface (in fact to the conventional algorithms — ML based or rule-based ones), but suggest negative context.</div>
<div style="text-align: justify;">
There are interesting situations that arise as a result of failing to recognize sarcasm. Borrowing from [1]:</div>
<div style="text-align: justify;">
<br /></div>
<b>User 1 tweet:</b><br />
<br />
You are doing great! Who could predict heavy travel between #Thanksgiving and #NewYearsEve. And bad cold weather in Dec! Crazy!<br />
<br />
<b>Response from a major U.S. Airline:</b><br />
<br />
We #love the kind words! Thanks so much.<br />
<b><br /></b>
<b>User 1:</b><br />
<br />
wow, just wow, I guess I should have #sarcasm<br />
<b><br /></b>
<b>User 2:</b><br />
<br />
Ahhh..**** reps. Just had a stellar experience w them at Westchester, NY last week. #CustomerSvcFail<br />
<b><br /></b>
<b>Response from a major U.S. Airline:</b><br />
<br />
Thanks for the shout-out Bonnie. We’re happy to hear you had a #stellar experience flying with us. Have a great day.<br />
<b><br /></b>
<b>User 2:</b><br />
<br />
You misinterpreted my dripping sarcasm. My experience at Westchester was 1 of the worst I’ve had with ****. And there are many.<br />
[1] <a href="http://dl.acm.org/author_page.cfm?id=81502803252&coll=DL&dl=ACM&trk=0&cfid=718000135&cftoken=28872379"><br />Rajadesingan</a> A. et al. <b>Sarcasm Detection on Twitter: A Behavioral Modeling Approach </b><a href="http://dl.acm.org/citation.cfm?id=2684822.2685316&coll=DL&dl=GUIDE"><b>Sarcasm Detection on Twitter</b></a></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-13115176723172180012016-10-09T11:51:00.000+03:002016-10-09T17:54:05.612+03:00Luke 6.2.1 release and all things open source<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 style="text-align: left;">
Release</h3>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Indeed, <a href="https://github.com/DmitryKey/luke/releases/tag/luke-6.2.1" target="_blank">luke 6.2.1</a> for lucene 6.2.1 is out of the oven. This is the proud moment for <a href="https://twitter.com/moco_beta" target="_blank">Tomoko Uchida</a>, my co-committer to have been a release manager for the first time. Congrats, Tomoko!</span></div>
<br />
<h3 style="text-align: left;">
Community</h3>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">As luke gets more and more stargazers on github (520 at the time of this writing), I tend to glance over <a href="https://github.com/DmitryKey/luke/stargazers" target="_blank">the list of them</a> which sometimes makes my day. But beyond that and more importantly, this lays out the community of Lucene / Solr / Elasticsearch users and developers, that hopefully enjoy using luke too. </span></div>
<br />
<h3 style="text-align: left;">
Big names on user list</h3>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Having access to the stats of the luke repo gives insights on who and when might be talking about luke. This time, it is PayPal Engineering. And here is their nice technical writeup on indexing lots of data in Elasticsearch and field usage of luke for optimizing the lucene index data structures: <a href="https://www.paypal-engineering.com/2016/08/10/powering-transactions-search-with-elastic-learnings-from-the-field/">https://www.paypal-engineering.com/2016/08/10/powering-transactions-search-with-elastic-learnings-from-the-field/</a></span></div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
London Lucene/Solr hackday</h3>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Hackday is an amazing way to jump out of a routine and think big: what can be improved in the search land of Lucene / Solr technology and tooling? It was great to see that luke was picked up as one topic on the Lucene / Solr hackday in London: <a href="https://github.com/flaxsearch/london-hackday-2016">https://github.com/flaxsearch/london-hackday-2016</a>. And there it is, Marple, browser-driven explorer for lucene indexes: <a href="https://github.com/flaxsearch/marple">https://github.com/flaxsearch/marple</a>. Go check it out.</span></div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: left;">
New contributors to luke</h3>
<div style="text-align: justify;">
<span style="font-family: "arial" , "helvetica" , sans-serif;">Tomoko and I have been active promoting luke on various occasions, <a href="https://www.youtube.com/watch?v=v_22bdxdDc0">Lucene / Solr Revolution 2015</a> and <a href="https://www.youtube.com/watch?list=PLGeM09tlguZTaS5FNoJGYEohaubtIvErS&v=fQAAzpk4oQ4#t=392">ApacheCon 2015</a>. And of course on twitter. Recently <a href="https://twitter.com/fhopf">Florian Hopf</a> has become active in sending pull requests to improve luke and fix various nagging issues. Welcome!</span></div>
<div>
<br /></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-52724094114032301002016-04-13T11:00:00.000+03:002016-04-13T11:00:36.798+03:00Luke 6.0 has been released<div dir="ltr" style="text-align: left;" trbidi="on">
#luke 6.0 has been released. Major upgrade to #lucene 6.0 api: <a href="https://github.com/DmitryKey/luke/releases/tag/luke-6.0.0">https://github.com/DmitryKey/luke/releases/tag/luke-6.0.0</a><br />
<br />
<br />
There are other interesting features cooking, like access to DocValues: <a href="https://github.com/DmitryKey/luke/pull/53">https://github.com/DmitryKey/luke/pull/53</a><br />
<br />
If you feel like contributing, either by code or documentation, feel free to join the project:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<img border="0" src="https://3.bp.blogspot.com/-gtkzvofERPM/UoM11zM23BI/AAAAAAAAJko/9Nry5SVPLzky05qD8NP6t7XHySbaK9CLA/s1600/luke-big.gif" /><a href="https://github.com/DmitryKey/luke" target="_blank">https://github.com/DmitryKey/luke</a></div>
<br /></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-48019497416904430482015-12-30T22:10:00.000+02:002015-12-30T22:12:00.556+02:00Apache Solr Enterprise Search Server -- Third edition<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: verdana, arial, helvetica, sans-serif;"><span style="background-color: white;">This year gave me a chance to be a technical reviewer of the book with search engine topic. The title is </span></span><span style="text-align: left;"><span style="font-family: verdana, arial, helvetica, sans-serif;"><i><a href="http://www.amazon.com/gp/product/B00YCDWG80?ref_=cm_rdp_product" target="_blank">Apache Solr Enterprise Search Server</a></i> and it saw the light in its third edition. The first edition back in 2010 helped me to start thinking in NoSQL way, despite that SQL has been literally everywhere (well, and still is). It does take a bit of mind warping to think beyond relational database lingo and data modelling and in my opinion is rather useful for your career as a software engineer.</span></span></div>
<div style="text-align: justify;">
<span style="text-align: left;"><span style="font-family: verdana, arial, helvetica, sans-serif; font-size: x-small;"><br /></span></span></div>
<div style="text-align: justify;">
<span style="text-align: left;"><span style="font-family: verdana, arial, helvetica, sans-serif; font-size: x-small;"><a href="http://4.bp.blogspot.com/-OUJPEnnm9gQ/VoQ5JXNwveI/AAAAAAAAfBc/gHnBVoMFUYM/s1600/Screen%2BShot%2B2015-12-30%2Bat%2B22.05.21.png" imageanchor="1"><img border="0" src="http://4.bp.blogspot.com/-OUJPEnnm9gQ/VoQ5JXNwveI/AAAAAAAAfBc/gHnBVoMFUYM/s400/Screen%2BShot%2B2015-12-30%2Bat%2B22.05.21.png" /></a></span></span></div>
<div style="text-align: justify;">
<span style="background-color: white; font-family: verdana, arial, helvetica, sans-serif; font-size: x-small;"><br /></span></div>
<div style="text-align: justify;">
<span style="background-color: white; font-family: verdana, arial, helvetica, sans-serif; font-size: x-small;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: verdana, arial, helvetica, sans-serif;"><span style="background-color: white;">Here goes my review on Amazon:</span></span></div>
<div style="text-align: justify;">
<span style="background-color: white; font-family: verdana, arial, helvetica, sans-serif;"><br /></span></div>
<blockquote class="tr_bq" style="text-align: justify;">
<span style="background-color: white; font-family: verdana, arial, helvetica, sans-serif;">This book in its first edition was the first one around back in 2010, that covered Apache Solr in as much detail as I needed to get into the topic quickly. This third edition includes revisions for Apache Solr 5, notoriously covering things like Solr admin page, SolrCloud, scaling the search engine for large amount of documents, text analysis, indexing, search and even map-reducing your Solr index! In particular, throwing a MapReduce task at large-scale indexing task has been hard / unclear in the past and now it is available to any user of Apache Solr out of the box. This makes books like this immensely important to not waste one's time in looking around for useful bits of information scattered here and there. More importantly, authors of the book are directly involved into the project, either as Apache Solr / Lucene committers or active practitioners and developers of the technology. So I recommend this book for an entry-level and mid-level search engineers that look into getting their hands dirty with search problems and / or improving on the previously untapped areas of the search engine world.</span></blockquote>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-76189885273534569782015-10-11T19:48:00.001+03:002015-10-11T19:48:38.171+03:00[ANNOUNCE] Luke 5.3.0 released: naturally runs on Java 8<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px; margin-bottom: 16px;">
This release runs on Java8 and does not run on Java7.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-gtkzvofERPM/UoM11zM23BI/AAAAAAAAJko/9Nry5SVPLzk/s1600/luke-big.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-gtkzvofERPM/UoM11zM23BI/AAAAAAAAJko/9Nry5SVPLzk/s1600/luke-big.gif" /></a></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px; margin-bottom: 16px;">
<br /></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px; margin-bottom: 16px;">
This release includes a number of pull requests and github issues. Worth mentioning:</div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px; margin-bottom: 16px;">
<a class="issue-link" href="https://github.com/DmitryKey/luke/pull/38" style="background-color: transparent; box-sizing: border-box; color: #4078c0; text-decoration: none;" title="#37 Updating to Lucene 5.3">#38</a> upgrade to 5.3.0 itself</div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
<a class="issue-link" href="https://github.com/DmitryKey/luke/pull/28" style="background-color: transparent; box-sizing: border-box; color: #4078c0; text-decoration: none;" title="Added LUKE_PATH env variable to luke.sh">#28</a> Added LUKE_PATH env variable to luke.sh<br style="box-sizing: border-box;" /><a class="issue-link" href="https://github.com/DmitryKey/luke/pull/35" style="background-color: transparent; box-sizing: border-box; color: #4078c0; text-decoration: none;" title="Added copy, cut, paste etc. shortcuts, using Mac command key.">#35</a> Added copy, cut, paste etc. shortcuts, using Mac command key<br style="box-sizing: border-box;" /><a class="issue-link" href="https://github.com/DmitryKey/luke/pull/34" style="background-color: transparent; box-sizing: border-box; color: #4078c0; text-decoration: none;" title="Fixed lastAnalyzer retrieval">#34</a> Fixed lastAnalyzer retrieval (this feature remembers the last used analyzer on the Search tab)<br style="box-sizing: border-box;" /><a class="issue-link" href="https://github.com/DmitryKey/luke/issues/31" style="background-color: transparent; box-sizing: border-box; color: #4078c0; text-decoration: none;" title="200 stars">#31</a> 200 stargazers on github (by the time of this release the number crossed 260). Luke community is growing.</div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
<br /></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
Everybody is welcome to contribute. If you feel like you care about search / indexing or would like to get deeper with Apache Lucene, go ahead and pick a ticket: <a href="https://github.com/DmitryKey/luke/issues">https://github.com/DmitryKey/luke/issues</a></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
And, don't be afraid, we do not have any complaint departments:</div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://habrastorage.org/files/ce0/650/4a6/ce06504a68b2407e8984b93b68609bac.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://habrastorage.org/files/ce0/650/4a6/ce06504a68b2407e8984b93b68609bac.jpg" width="320" /></a></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
<br /></div>
<div style="background-color: white; box-sizing: border-box; color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 23.2727px;">
All you need is your favourite beverage and a good debugger.</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-16366511477487173112015-07-08T11:42:00.000+03:002015-07-08T11:42:13.703+03:00[ANNOUNCE] Luke 5.2.0 released<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
<span style="line-height: 25.6000003814697px;">This is a major release supporting lucene / solr 5.2.0. Download the zip here:</span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://raw.githubusercontent.com/DmitryKey/luke/pivot-luke/src/main/resources/img/luke-big.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://raw.githubusercontent.com/DmitryKey/luke/pivot-luke/src/main/resources/img/luke-big.gif" /></a></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
<br /></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
<a href="https://github.com/DmitryKey/luke/releases/tag/luke-5.2.0" style="line-height: 25.6000003814697px;" target="_blank">https://github.com/DmitryKey/<wbr></wbr>luke/releases/tag/luke-5.2.0</a></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
It supports elasticsearch 1.6.0 (lucene 4.10.4)</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px; margin-bottom: 16px;">
Issues fixed:<br /><a href="https://github.com/DmitryKey/luke/issues/20" style="color: #4078c0; text-decoration: none;" target="_blank" title="Reconstructing non-stored fields">#20</a> Added support for reconstructing field values of indexed and not stored fields, that do not expose positions.</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px;">
Pull requests:<br /><a href="https://github.com/DmitryKey/luke/pull/23" style="color: #4078c0; text-decoration: none;" target="_blank" title="Elasticsearch support and Shade plugin for assembly">#23</a> Elasticsearch support and Shade plugin for assembly<br /><a href="https://github.com/DmitryKey/luke/pull/26" style="color: #4078c0; text-decoration: none;" target="_blank" title="added .gitignore to project">#26</a> added .gitignore to project<br /><a href="https://github.com/DmitryKey/luke/pull/27" style="color: #4078c0; text-decoration: none;" target="_blank" title="Lucene 5x support">#27</a> Lucene 5x support<br /><a href="https://github.com/DmitryKey/luke/pull/28" style="color: #4078c0; text-decoration: none;" target="_blank" title="Added LUKE_PATH env variable to luke.sh">#28</a> Added LUKE_PATH env variable to luke.sh<br /><a href="https://github.com/DmitryKey/luke/pull/30" style="color: #4078c0; text-decoration: none;" target="_blank" title="Luke 5.2">#30</a> Luke 5.2</div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px;">
<br /></div>
<div style="color: #333333; font-family: 'Helvetica Neue', Helvetica, 'Segoe UI', Arial, freesans, sans-serif; font-size: 16px; line-height: 25.6000003814697px;">
<span style="line-height: 25.6000003814697px;">I'd like to highlight the contribution of </span><a href="https://twitter.com/moco_beta" style="line-height: 25.6000003814697px;" target="_blank">Tomoko Uchida</a> who has been recently very active in sending pull requests, including upgrade to lucene 5.x and first version of <a href="https://github.com/DmitryKey/luke/tree/pivot-luke" target="_blank">Apache Pivot based luke ui</a>.</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com1tag:blogger.com,1999:blog-7093676993132865793.post-58686311383606212702015-04-15T23:57:00.001+03:002015-04-20T15:28:53.128+03:00Luke gets support for Elasticsearch indices<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
That is that, really. The <a href="https://simpsora.wordpress.com/2014/05/06/using-luke-with-elasticsearch/" target="_blank">so</a> long <a href="http://stackoverflow.com/questions/24233193/elasticsearch-and-luke" target="_blank">awaited</a> proper support for elasticsearch indices.<br />
<br />
<blockquote class="twitter-tweet" lang="ru">
<a href="https://twitter.com/hashtag/luke?src=hash">#luke</a> gets support for <a href="https://twitter.com/hashtag/elasticsearch?src=hash">#elasticsearch</a> <a href="https://twitter.com/hashtag/lucene?src=hash">#lucene</a> indices! Just tested with es 1.5.0 <a href="https://t.co/3hsilTEaAR">https://t.co/3hsilTEaAR</a> <a href="https://twitter.com/hashtag/apachecon?src=hash">#apachecon</a> cc .<a href="https://twitter.com/elastic">@elastic</a><br />
— Dmitry Kan (@DmitryKan) <a href="https://twitter.com/DmitryKan/status/588010545372102657">14 апреля 2015</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script><br />
<br />
<blockquote class="twitter-tweet" lang="ru">
<a href="https://twitter.com/yevhen">@yevhen</a> <a href="https://twitter.com/elastic">@elastic</a> :) thanks!<br />
— Dmitry Kan (@DmitryKan) <a href="https://twitter.com/DmitryKan/status/588062479046983680">14 апреля 2015</a></blockquote>
<script async="" charset="utf-8" src="//platform.twitter.com/widgets.js"></script><br />
<br />
Luke supported Apache Solr indices already. Why not Elasticsearch? The reason was, that ES uses its own SPI for postings format. If you tried to open an Elasticsearch index with luke before, you'd get something like:<br />
<br />
<blockquote class="tr_bq">
<pre style="background-color: white; border: 1px solid rgb(237, 237, 237); color: #666666; font-family: Consolas, Monaco, 'Lucida Console', monospace; font-size: 0.857142857rem; font-style: italic; line-height: 1.714285714; margin-bottom: 1.714285714rem; margin-top: 1.714285714rem; overflow: auto; padding: 1.714285714rem; vertical-align: baseline;">A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene40, Lucene41]</pre>
</blockquote>
<br />
<br />
The biggest issue of supporting custom SPI is that you'd need to hack the luke jar binary and add the ES SPI. I bet it is not what you would want to spend your time on.<br />
<br />
With the excellent pull request by <a href="https://github.com/apakulov" target="_blank">apakulov</a> <a href="https://github.com/DmitryKey/luke/pull/23">https://github.com/DmitryKey/luke/pull/23</a> luke uses <a href="https://maven.apache.org/plugins/maven-shade-plugin/" target="_blank">shade maven plugin</a>, that does all the magic. It magically updates the in-binary META-INF/services file with the following entry:<br />
<br />
<pre class="bash" name="code">org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat
org.elasticsearch.search.suggest.completion.Completion090PostingsFormat
org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat
</pre>
<br /></div>
<br />
Currently this is available on luke master: <a href="https://github.com/DmitryKey/luke">https://github.com/DmitryKey/luke</a> and a pre-release: <a href="https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4-field-reconstruction">https://github.com/DmitryKey/luke/releases/tag/luke-4.10.4-field-reconstruction</a></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-31745809603863598602015-03-21T13:24:00.000+02:002015-03-21T13:24:24.399+02:00Flexible run-time logging configuration in Apache Solr 4.10.x<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="comment-content wiki-content">
<div style="text-align: justify;">
In a multi-<span style="background-color: transparent; font-size: 11.0pt; line-height: 16.0pt;">shard setup it is useful to be able to change log level in runtime without going to each and every shard's admin page.</span></div>
<br />
<div style="text-align: justify;">
For example, we can set the logging to WARN level during massive posting sessions and back to INFO, when serving the user queries.</div>
<br />
<div style="text-align: justify;">
In solr 4.10.2 these one-liners do the trick:</div>
<br />
<pre class="bash" name="code"># set logging level to WARN,</pre>
<pre class="bash" name="code"># saves disk space and speeds up massive posting </pre>
<pre class="bash" name="code">curl -s http://localhost:8983/solr/admin/info/logging \</pre>
<pre class="bash" name="code"> --data-binary "set=root:WARN&wt=json" </pre>
<pre class="bash" name="code"> </pre>
<pre class="bash" name="code"># set logging level to INFO,</pre>
<pre class="bash" name="code"># suitable for serving the user queries </pre>
<pre class="bash" name="code">curl -s http://localhost:8983/solr/admin/info/logging \</pre>
<pre class="bash" name="code"> --data-binary "set=root:INFO&wt=json"
</pre>
<br />
<div style="text-align: justify;">
Back from Solr you get a JSON with the current status of each configured logger.</div>
</div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-51358599995121812882015-03-16T14:40:00.001+02:002015-03-16T18:35:56.390+02:00Luke keeps getting updates and now on Apache Pivot<div dir="ltr" style="text-align: left;" trbidi="on">
Originally developed for fun and profit by Andrzej Bialecki, the lucene toolbox <a href="https://github.com/dmitrykey/luke" target="_blank">luke</a> continues to be developed. Its releases are published at: <a href="https://github.com/DmitryKey/luke/releases">https://github.com/DmitryKey/luke/releases</a><br />
<br />
<br />
Most recently <a href="https://twitter.com/moco_beta" target="_blank">Tomoko Uchida</a> has contributed into effort of porting Luke to an Apache License 2.0 friendly GUI framework Apache Pivot. New branch has been created to host this work:<br />
<br />
<a href="https://github.com/DmitryKey/luke/tree/pivot-luke">https://github.com/DmitryKey/luke/tree/pivot-luke</a><br />
<br />
Currently supported Lucene: 4.10.4.<br />
<br />
It is far from completion, but already now you can:<br />
<br />
<ul style="text-align: left;">
<li>open your Lucene index and check its metadata</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-1TcGd3R-hvk/VQbNSXJpwmI/AAAAAAAAT0Q/sfQaxHgT4fg/s1600/Selection_178.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-1TcGd3R-hvk/VQbNSXJpwmI/AAAAAAAAT0Q/sfQaxHgT4fg/s1600/Selection_178.png" height="285" width="320" /></a></div>
<br />
<ul style="text-align: left;">
<li>page through the documents and analyze fields</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-53foXuFVje4/VQbN4V9S-HI/AAAAAAAAT0g/Y9kjT8Umyf0/s1600/Selection_180.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-53foXuFVje4/VQbN4V9S-HI/AAAAAAAAT0g/Y9kjT8Umyf0/s1600/Selection_180.png" height="285" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
<ul style="text-align: left;">
<li>search the index</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-URTOIdgOE-4/VQbOYpuBEpI/AAAAAAAAT0o/LhkfYLaArLQ/s1600/Selection_181.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-URTOIdgOE-4/VQbOYpuBEpI/AAAAAAAAT0o/LhkfYLaArLQ/s1600/Selection_181.png" height="283" width="320" /></a></div>
<br />
We will appreciate if you could test the pivot luke and give your <a href="https://github.com/DmitryKey/luke/issues" target="_blank">feedback</a>. <br />
<br /></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-38320838802477427392014-11-17T16:39:00.001+02:002015-03-23T11:24:11.143+02:00Lightweight Java Profiler and Interactive svg Flame Graphs<div dir="ltr" style="text-align: left;" trbidi="on">
<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
A colleague of mine has just returned from the AWS re:Invent and brought in all the excitement about new AWS technologies. So I went on to watching the released videos of the talks. One of the first technical ones I have set on watching was <a href="https://www.youtube.com/watch?v=7Cyd22kOqWc&list=UUd6MoB9NC6uYN2grvUNT-Zg" target="_blank">Performance Tuning Amazon EC2 Instances</a> by Brendan Gregg of Netflix. From Brendan's talk I have learnt about Lightweight Java Profiler (LJP) and visualizing stack traces with Flame Graphs.<br />
<br />
I'm quite 'obsessed' with monitoring and performance tuning based on it.</div>
<div style="text-align: justify;">
Monitoring your applications is definitely the way to:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. Get numbers on performance inside your company, spread them and let people talk stories about them.</div>
<div style="text-align: justify;">
2. Tune the system in where you see the bottleneck and measure again.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In this post I would like to share a shell script that will produce a colourful and interactive flame graph out of a stack trace of your java application. This may be useful in a variety of ways, starting from an impressive graph for you slides to making informed tuning of your code / system.</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Components to build / install</h3>
<div style="text-align: justify;">
This was run on ubuntu 12.04 LTS.</div>
<div style="text-align: justify;">
Checkout the <a href="http://lightweight-java-profiler.googlecode.com/" target="_blank">Lightweight Java Profiler</a> project source code and build it:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<pre class="bash" name="code">svn checkout \</pre>
<pre class="bash" name="code"> http://lightweight-java-profiler.googlecode.com/svn/trunk/ \</pre>
<pre class="bash" name="code"> lightweight-java-profiler-read-only </pre>
<pre class="bash" name="code"> </pre>
<pre class="bash" name="code">cd lightweight-java-profiler-read-only/
make BITS=64 all
</pre>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
(omit the BITS parameter if you want to build for 32 bit platform).</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As a result of successful compilation you will have a liblagent.so binary that will be used to configure your java process.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br />
Next, clone the <a href="https://github.com/brendangregg/FlameGraph" target="_blank">FlameGraph</a> github repository:<br />
<br /></div>
<div style="text-align: justify;">
<pre class="bash" name="code">git clone https://github.com/brendangregg/FlameGraph.git</pre>
</div>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You don't need to build anything, it is a collection of shell / perl scripts that will do the magic.</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Configuring the LJP agent on your java process</h3>
<div style="text-align: justify;">
Next step is to configure the LJP agent to report stats from your java process. I have picked a Solr instance running under jetty. Here is how I have configured it in my Solr startup script:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<pre class="bash" name="code">java \</pre>
<pre class="bash" name="code">-agentpath:/.../lightweight-java-profiler-read-only/\</pre>
<pre class="bash" name="code"> build-64/liblagent.so \</pre>
<pre class="bash" name="code">-Dsolr.solr.home=cores start.jar</pre>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Executing the script should start the Solr instance normally and will be logging stack trace to traces.txt</div>
<div style="text-align: justify;">
<br /></div>
<h3 style="text-align: justify;">
Generating a Flame graph</h3>
<div style="text-align: justify;">
In order to produce a flame graph out of the LJP stack trace you will need to perform the following:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
1. Convert LJP stack trace into a collapsed form that FlameGraph understands.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
2. Call flamegraph.pl tool on the collapsed stack trace and produce the svg file.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br />
I have written a shell script that will do this for you.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<pre class="bash" name="code">#!/bin/sh
# change this variable to point to your FlameGraph directory
FLAME_GRAPH_HOME=/home/dmitry/tools/FlameGraph
LJP_TRACES_FILE=${1}
FILENAME=$(basename $LJP_TRACES_FILE)
JLP_TRACES_FILE_COLLAPSED=\</pre>
<pre class="bash" name="code"> $(dirname $LJP_TRACES_FILE)\</pre>
<pre class="bash" name="code"> /${FILENAME%.*}_collapsed.${FILENAME##*.}
FLAME_GRAPH=\</pre>
<pre class="bash" name="code"> $(dirname $LJP_TRACES_FILE)/${FILENAME%.*}.svg
# collapse the LJP stack trace
$FLAME_GRAPH_HOME/stackcollapse-ljp.awk $LJP_TRACES_FILE > \</pre>
<pre class="bash" name="code"> $JLP_TRACES_FILE_COLLAPSED
# create a flame graph
$FLAME_GRAPH_HOME/flamegraph.pl $JLP_TRACES_FILE_COLLAPSED > \</pre>
<pre class="bash" name="code"> $FLAME_GRAPH
</pre>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br />
And here is the flame graph of my Solr instance under the indexing load.<br />
<br />
<br />
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-_mVFepztpfA/VGn7tvgIBHI/AAAAAAAARhc/9jGOXPpyoPs/s1600/traces.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-_mVFepztpfA/VGn7tvgIBHI/AAAAAAAARhc/9jGOXPpyoPs/s1600/traces.png" height="400" width="287" /></a></div>
<div style="text-align: justify;">
You could interpret this diagram bottom-up: the lowest level is entry point class that starts the application. Then we see that CPU-wise two methods are taking the most of the time: org.eclipse.jetty.start.Main.main and java.lang.Thread.run.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This svg diagram is in fact an interactive one: load it in the browser and click on the rectangles with methods you would like to explore more. I have clicked on the <br />
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd rectangle and drilled down to it:</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-IN_JJr-pw04/VGoGebxXxTI/AAAAAAAARho/DML0W-fqCl4/s1600/Selection_158.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-IN_JJr-pw04/VGoGebxXxTI/AAAAAAAARho/DML0W-fqCl4/s1600/Selection_158.png" height="182" width="320" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
It is this easy to setup a CPU performance check for your java program. Remember to monitor before tuning your code and wear a helmet.<br />
<br /></div>
</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-81793193643699832472014-11-14T09:32:00.000+02:002014-11-14T09:32:15.165+02:00Ruby pearls and gems for your daily routine coding tasks<div dir="ltr" style="text-align: left;" trbidi="on">This is a list of ruby pearls and gems, that help me in my daily routine coding tasks.<br />
<br />
<div class="separator" style="clear: both; text-align: center;"><a href="https://www.ruby-lang.org/en/"><br />
<a href="https://3.bp.blogspot.com/-0odeUYBDsWU/VGWu2tRsOXI/AAAAAAAARfY/TGMxC1rK_9c/s1600/Selection_154.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="82" src="https://3.bp.blogspot.com/-0odeUYBDsWU/VGWu2tRsOXI/AAAAAAAARfY/TGMxC1rK_9c/s320/Selection_154.png" width="320" /></a></a></div><br />
<br />
<br />
1. Retain only unique elements in an array:<br />
<br />
<pre class="ruby" name="code">a = [1, 1, 2, 3, 4, 4, 5]
a = a.uniq # => [1, 2, 3, 4, 5]
</pre><br />
2. <a href="http://ruby-doc.org/stdlib-2.1.4/libdoc/optparse/rdoc/OptionParser.html">Command line options</a> parsing:<br />
<br />
<pre class="ruby" name="code">require 'optparse'
class Optparser
def self.parse(args)
options = {}
OptionParser.new do |opts|
opts.banner = "Usage: example.rb [options]"
opts.on("-v", "--[no-]verbose", "Run verbosely") do |v|
options[:verbose] = v
end
opts.on("-o", "--require OUTPUTDIR", "Output directory") do |o|
options[:output_dir] = o
end
options[:source_dir] = []
opts.on("-s", "--require SOURCEDIR", "Source directory") do |s|
options[:source_dir] << s
end
end.parse!
options
end
end
options = Optparser.parse(ARGV) #pp options When executed with -h, this script will automatically show the options and exit.
</pre><br />
3. Delete a key-value pair in the hash map, where the key matches certain condition: <br />
<br />
<pre class="ruby" name="code">hashMap.delete_if {|key, value| key == "someString" }
</pre><br />
Certainly, you can use regular expression based matching for the condition or a custom function, say, on the 'key' value.<br />
<br />
<br />
4. Interacting with mysql. I use <a href="http://www.rubydoc.info/gems/mysql2/0.3.16/frames" target="_blank">mysql2</a> gem. Check out the documentation, it is pretty self-evident.<br />
<br />
5. Working with Apache SOLR: rsolr and rsolr-ext are invaluable here:<br />
<br />
<pre class="ruby" name="code">require 'rsolr'
require 'rsolr-ext'
solrServer = RSolr::Ext.connect :url => $solrServerUrl, :read_timeout => $read_timeout, :open_timeout => $open_timeout
doc = {field1=>"value1", "field2"=>"value2"}
solrServer.add doc
solrServer.commit(:commit_attributes => {:waitSearcher=>false, :softCommit=>false, :expungeDeletes=>true})
solrServer.optimize(:optimize_attributes => {:maxSegments=>1}) # single segment as output
</pre><br />
</div>Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-31015628412335355882014-09-23T14:50:00.000+03:002016-12-29T09:27:26.452+02:00Indexing documents in Apache Solr using custom update chain and solrj api<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
This post focuses on how to target custom update chain using solrj api and index your documents in Apache Solr. The reason for this post existence is because I have spent more than one hour figuring this out. This warrants a blog post (hopefully for other's benefit as well).</div>
<br />
<h3>
Setup</h3>
<br />
<div style="text-align: justify;">
Suppose that you have a default update chain, that is executed in every day situations, i.e. for majority of input documents:</div>
<br />
<pre class="xml" name="code"><updaterequestprocessorchain default="true" name="everydaychain">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updaterequestprocessorchain>
</pre>
<br />
<div style="text-align: justify;">
In some specific cases you would like to execute a slightly modified update chain, in this case with a factory that drops duplicate values from document fields. For that purpose you have configured a custom update chain:</div>
<br />
<pre class="xml" name="code"><updaterequestprocessorchain name="customchain">
<processor class="solr.UniqFieldsUpdateProcessorFactory" >
<lst name="fields">
<str>field1</str>
<lst>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updaterequestprocessorchain>
</pre>
<br />
<div style="text-align: justify;">
Your update request handler looks like this:</div>
<br />
<pre class="xml" name="code"><requesthandler class="solr.UpdateRequestHandler" name="/update">
<lst name="defaults">
<str name="update.chain">everydaychain</str>
</requesthandler>
</pre>
<br />
<div style="text-align: justify;">
Every time you hit /update from your solrj backed code, you'll execute document indexing using the "everydaychain".</div>
<br />
<h3>
Task</h3>
<br />
<div style="text-align: justify;">
Using solrj, index documents against the custom update chain. </div>
<br />
<h3>
Solution</h3>
<br />
<div style="text-align: justify;">
First before diving into the solution, I'll show the code that you use for normal indexing process from java, i.e. with every:</div>
<br />
<pre class="java" name="code">HttpSolrServer httpSolrServer = null;
try {
httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/core0");
SolrInputDocument sid = new SolrInputDocument();
sid.addField("field1", "value1");
httpSolrServer.add(sid);
httpSolrServer.commit(); // hard commit; could be soft too
} catch (Exception e) {
if (httpSolrServer != null) {
httpSolrServer.shutdown();
}
}
</pre>
<br />
<div style="text-align: justify;">
So far so good. Next turning to indexing with custom update chain. This part of non-obvious from the point of view of solrj api design: having an instance of SolrInputDocument, how would one access a custom update chain? You may notice, how the update chain is defined in the update request handler of your solrconfig.xml. It uses the update.chain parameter name. Luckily, this is an http parameter, that can be supplied on the /update endpoint. Figuring this out via http client of the httpSolrServer object led to nowhere.</div>
<br />
<div style="text-align: justify;">
Turns out, you can use UpdateRequest class instead. The object has got a nice setParam() method that lets you set a value for the update.chain parameter:</div>
<br />
<pre class="java" name="code">HttpSolrServer httpSolrServer = null;
try {
httpSolrServer = new HttpSolrServer(updateURL);
SolrInputDocument sid = new SolrInputDocument();
// dummy field
sid.addField("field1", "value1");
UpdateRequest updateRequest = new UpdateRequest();
updateRequest.setCommitWithin(2000);
updateRequest.setParam("update.chain", "customchain");
updateRequest.add(sid);
UpdateResponse updateResponse = updateRequest.process(httpSolrServer);
if (updateResponse.getStatus() == 200) {
log.info("Successfully added a document");
} else {
log.info("Adding document failed, status code=" + updateResponse.getStatus());
}
} catch (Exception e) {
e.printStackTrace();
if (httpSolrServer != null) {
httpSolrServer.shutdown();
log.info("Released connection to the Solr server");
}
}
</pre>
<br />
Executing the second code will trigger the LogUpdateProcessor to output the following line in the solr logs:<br />
<br />
<pre class="bash" name="code">org.apache.solr.update.processor.LogUpdateProcessor –</pre>
<pre class="bash" name="code"> [core0] webapp=/solr path=/update params={wt=javabin&</pre>
<pre class="bash" name="code"> version=2&<b>update.chain=customchain</b>}
</pre>
<br />
That's it for today. Happy indexing!</div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com3tag:blogger.com,1999:blog-7093676993132865793.post-45208002591086870912014-09-17T17:49:00.000+03:002014-09-17T20:30:25.321+03:00Exporting Lucene index to xml with Luke<div dir="ltr" style="text-align: left;" trbidi="on"><div dir="ltr" style="text-align: left;" trbidi="on"><div dir="ltr" style="text-align: left;" trbidi="on"><div style="text-align: justify;"><a href="https://github.com/dmitrykey/luke" target="_blank">Luke</a> is the open source Lucene toolbox originally written by Andrzej Bialecki and currently maintained by yours truly. The tool allows you to introspect into your solr / lucene index, check it for health, fix problems, verify field tokens and even experiment with scoring or read the index from HDFS.</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">In this post I would like to illustrate one particular luke's feature, that allows you to dump index into an xml for external processing.</div><div style="text-align: justify;"><br />
</div><h3 style="text-align: justify;">Task</h3><div style="text-align: justify;">Extract indexed tokens from a field to a file for further analysis outside luke.</div><h3 style="text-align: justify;"> </h3><h3 style="text-align: justify;">Indexing data</h3><div style="text-align: justify;">In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;">If you are indexing using Apache Solr, you would configure the following on your field:</div><div style="text-align: justify;"><br />
</div><div style="text-align: justify;"><pre class="xml" name="code"><field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">
</pre></div></div></div><div style="text-align: justify;"><br />
With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.</div><h3 style="text-align: justify;"> </h3><h3 style="text-align: justify;">Extracting index terms</h3>One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").<br />
<br />
<img alt="" src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAhkAAAD8CAIAAABVbiHEAAAAA3NCSVQICAjb4U/gAAAAEHRFWHRTb2Z0d2FyZQBTaHV0dGVyY4LQCQAAE4pJREFUeNrt3V2MHdddAPCzyKKlXkxUrJpaeLOua2zZcuPW1hq7SdNGtIIqKqCWon5EaquIlk+BBA/0gW7EAzyCVKgKAZUKUxSVLyNaOSitE1teeWWnTlwbb13HjhNCrZo2pQkFtOvlYeKT6Xzd2bn37p298/tpH86ce2bmnLln5j9n7t17Jm5cPxsAoA/repb45y985cmFr4QQvnHjP5699u9veceh37rvgw4cAKWx5Ofv+5Xbbrvt9a97fQjhNa979eSGyclN69/9tnuTVy9duPz8wv84agBj7/CR+TrFPvCumRDCuqWl5XTu5z/zJ7//h3/x/PPPT7xy+TsXvp1kHn/45Ed+874k8dof35RZBYCxlMSJ6niTRIR1S0s3M699/Hc+nCQ+/eCREMJz33z20L37Qwgv/NcLr7r5I7/6i+/LrxJC2LxlXz7zuWfODLxtm7fsy2+2MLPBdgCI6owckoiwbjFV9AO/9Bvbd21L0m/8yTve/uE3p1d46C//4bVTmxZLNn3t6ukkMTW9P6YXhzOCKdxsg30tGmAB9IoTyVAh3nxnFhfz45LPfuqPPvPZf3num89Oblj/1NlrT529luS/+2P3hhC+d+P/7v/19xYOSgp3H23d9vIo6crl+Zh55fJ88lJMxDJxMZZPvzo1vT+dn2wq2WnhvtL5SWayODW9P1OlnvXUt4DuSN9wb96y79rV08llM/3SrXHJ4vfdm3/w/e9MEn/9N1945jtPTW6YfOd73h5C+OLnH37d3i2ZwqW7TxXbvuNACOHSwqkkvXXbTJJOLtNJOpZPFpMLd0zH8mW7iItl+0rnJ5mXFk5t33EgvfcV1ROgO+OSeD8dA0m8fY8XxpfHJX/6t5+ND7gmN6x/+4ffHMLLz7gunP7aH/ze79YZlOTHJRcvzCU5Fy/M7dx1ML4a8/OL6XThTjOZ+RXL9lWx5Zr1BOjKuCR1o5/ccCfp9I31S+OS+NHKR3/hvk9++q9eOfmKZPHJx/7t6tWrH/34hyY3TH7j2etvOfC2+l/fypTcuetg4auZYunFsnTi/LmTO3cdPH/uZAhh955D58+djGVq7qtnBXquC9CRcUnm0rp9x4GLF+Yyl8p1i6k77o/df196K7/98U9MbpgMITz28Mkf3rB+sfa9eabkE2dPFL6aKZZeLEv33E7NffWsQM91AToQS5aTW/Z4H5+k4918vEKuK7vp/rsv/9PMPW9M0t/+xnc+9LPvbzwuuWPvnY+fOZ6k37TvrphuPC4JITx+5vgde+9MEukCZfuK+RUVSJcxLgHI3KkvLt184uyJ5Nq7+PKHBcm4ZLH4pvvFZxd/+iNvSdIbX72xrFjx7lOF5089Gq/OyWJ8NbPN9GJZutm+5k89OnPg7iQ/XYGYkylTXU+ATo1LkpvseFedWUyukBNzJ48VbuKhI3+ffBU4hPDFzz/8ravfDSG85kd/bHp6+uv/ef7eQ/c6ygBjbP781w6+YUd1mbknF2Z2/0Qo/L/3xPof3HDpwuXkm10/8553xPxLFy5/68x3lw64VQcYc3NPLtQYu+T+7z3tp+5525e//NhjR+YmXvlSgVcsvup/1/13CCGs8x/jAGPuju3b6hRLwsHEI4/8q0MGQD9e+uz94rWnHQsAVmrn1O0hPX9Jz98WBoC0OMfJDzgWAPRJLAFALAFALAFALAFALAEAsQSAtRpLNm7au3HTXkcQgIaxZOOmvcvLYXk5CCcANIwlN66fnZgIExNhdnb2k5/6R8dxgKO0PrdTvyZlxdwfACu1rp/r3ezsbAjh137552pej25cP9vOMNBnxfrcQv8VKNzUADc7kKDY5yEaSP8Z1HaAAcSS5AFXCGFiYrbstGzPFa2z2hBI0t3AOwJiSUEgCSEsL4eJiRWHivRlJRNy4gUoUyYu5veViVVxsXAv6fx0sYoLX1kNMxtM5/e5emE9KypWtqnMGKU6xlQc4YHc/pe1ojreZI5nxVHqZzs9Dw7Q04o/L7lx/ewDD8xOTLycs6LPS+JpnL/sJte+zAUoff3NxJgGe8lvKimQ3u+KapjfQp+rZ0qmm1yx5YpN1Vmx7AgPYzCRDloVDckczEx+YbsK+2rP7aQ3Ur+DAf3GkjgiCeGlz94bP3spvKoWLlbcLWautvmNl222ept1ajik1StK1ql8sxUbb3ngz99qVqB6jNK4IcYlsNrPuB544KU4UvjZe81nXMMzljeYjRvVbMX2X1j7f5eTe5HBPuIDsaS32dnZEGZvJZqceyO/8127GjdqXC+RA2lX+nmXr4pAMyt7xrVx095PfGI23Hq6dezYsQY3hgN/Ih9vLcueeBSmK3Zdp0ydBg58bNT40PVcsbDOjetf+FnRwMcTA6mtD0hgNOOSxPJyCGE2/q9i/cdcmUvAkO4By/aSfqBR+Clu/w89mq1e5/8eGh+6nitmDstgA3xhfrOG5D9Cb/ZO5d/usR/AwbBNHD16NIRw8drTNed7T/1zSdX/KgIw9g4fmd85dXto9p3g+OspAgkAodkzrhvXzyb/UyKQABAaf14iigAQmQsLALEEALEEALEEALEEAMQSAMQSAMQSAMQSABBLABBLABBLABBLAEAsAUAsAWDt6j1/yeEj8w4TQJf1nMR93UC2Mk4OH5lvQ3tbUg3tBeqMKDzjAqBftcYlS0vLnTooLWmvww6MWSy52bGL2k3VaGd7N2/ZF0J47pkzmTSwBmLJ4tLy1PT+fP61q6fH8qAsLi2HEDJNXv3GDrAaU9P7kxVjIpPfnsMeK5Zvb6xt0iFjutmhAEYwLrly+aXPXrZum4npcb1xju2KLR1JYwurMTW9P71Y05XL83Fr6Yak89tz2NN9LNPedG0b17xrAz5ozbhkcblicfuOAzF9aeFUzLy0cCp5KSZimbgYy7drXHKrgZmW5ttV1vx0flK+QUvLqhEX6+w6X+2t22bSlU+XqX4fe+5lIIc92WNs46WFU1u3zWQqH0NO/l0orGGmWHrFYTQExJJa93HpxZ27DoYQLl6YS9LbdxxI0skpmqRj+WQxOXVjOpZv4bik8AY23a6y5qfzk/Y2uBcuq0ayWLHr/FuQrHXxwlx8NbPxOu9jOl22lyEd9kzl0xVI3oXCGuZblD8Cw2gIiCVlJ/lyxeL5cyeTnPPnTu7ecyi+GvPzi+l0C7+9E6uUXI9inavbVdb8TH7/1ei5i+rjXPZu1nwfq/cykMOe31rhS4XpFR2fITUExJKShw/ff5+YWdy951DhqxVrlaXb8ozrVpWeOHuiZ7saNH8g1SjbxRNnT8Qq1al/s/yyvQzksOePVeFL6eNTWJk6b80wGgJiSZNxyeNnjte5u+x5R9nCcUlh3fpv/kCqUbGLWKU79t4Z0z0rttL8wr2MalxSWJmab83AGwJiSckN4+LNisU37btr/tSjSXrmwN0xXbFWWbot45JbVSqsW83mx/yZA3fHtdIF+qxGzV3kN1L2vhRurax82V76P+zzpx7NH9LCysR0WWXq9MxhNATEkhWPS+ZOHotXn2SxU+OSsubPnTx28NBbk/wk3eDzoepV0rvI7Lr67Yi1SueXba3sfSzby0AOe8XGC9OF5St6ZvoIDKMh0EETR48eDSFcvPZ02S/rHT4yv2/n67tzRM5c/PrA23vnXfecOP6lkVeja4cdGNTpWREgdk7dHur/33unDtww2ttgmw47sFb4Pa5Vam8//1/isAPjEEu++tSVTh2Ugbf3j//8zxps02EHxiqW3P++93bqoDz4uYfa0OSWVEN7oeMe/NxDPcuYCwuAfoklAIglAIzaulHteP3kbS++8PzqrNV/VZPE6u+6rCZtqEwb2psu0J3DAmJJv0ZypUh2mr9sjfAItKQyI29vpj9057BA2zR8xrV+8rbkrzAznZ+kM/npzMKrQFnJLl8s0mOyF194fuwPxUrbO5IxK9B8XJKc1fGGMX3nGNPpEzufTi4NNc981wiA8RyXpG8YqzND3w+mBBLccMAYjkuqxysw7LDhJgPGOZY4t1kF+W5mUAIj1/yz9/xYpCxtfDOoa2j6sI/91bNr7YXOjUuSkzz9YXvIfXezzpmf3kL6wtG2LyllAudoL2pdC7S+vgVjG0sqrqf5/LLP4atLDulj/ME2tss1aU97BRJoA7+hAoBYAoBYAsBaV+vzku+9eLVrx6UlTe7ake9gT4MOxZIfWj/dscMy344mz3fsyM93r6fB2jg3e5bwjAuAfoklAIglAIzaGptXsXpTQ/oXaPMqtra95lWErseSAar4X/oBbt+8ii1sr3kVoSXW0ryKFWUKC4zfTbp5FVdnpAusxrhkVPMqFl5NMttxQQFYM+OSkcyrWPgj5CIH7iFgTY5LqscrI7mOeCM7FTYGco8CtDSWrMK5nX84ll/0po438ypCC63JeRXLrh3jHUjMq+h0hbEal4x2XsXMq6swIaN5FdvcXmEG1mosCaObV7HnTodxWTGvYpvbK5BAG6y931BxHwoglrg3BxBLAOD7mVex1U02ryIwPrHEvIrdrob2QseZVxGA4RNLABBLABi1hr853+yLuWv0X0Na8mOCHfxNw/rzKhb+SFf1usCIY0nXtGQqw67NqFhnZpqyY5KZMsf/t0LrYkmc6DB9u5e/Byz7+d42/KQVHQn/Me3Xo2HYfqDZWZrMjZgJD5kJFjM/8ppfEYCOjkuqbwPdAzLAscVAPvPwgAvWTCwpjB/xWuBMpv8YEBdXOq+B7gdrb1xSeBo7nxnseMWIBNpmMP9fMqgZFWFIwxqgjeOSzJSIIffkIX0mpydpX4tPvVryJbSufReuzuclZcck/W3DlY5mgFWKJYWnZdn8ienFNXo+t6TaHbwa9mxy/fk9gaHyGyoAiCUAiCUArHXmVWx1k82rCIxPLDGvYrerob3QceZVBGD4xBIAxBIARm3E8yquid/sMq/iyJsczKsI4xdLBij9OyutPUbmVRxVIDGvIoxtLBnsvIpuHhnqPUowpw60M5akf6gxHx7iPWDm9xzLAkzPG08AxjCWVN8GrugeUOSg511L/73FPQqsmVjSeF5FDx+oGQPMqwgdGpcUnsZl53PZZyowkOGsQAKrpi3zKgokDHVYA7RxXDKQeRXTD8Tb/GUb8yqOavxhXkUY51gywHkVM9/dbO1FTTXa2WTzKkJL+A0VAMQSAMQSANY68yq2usnmVQTGJ5Z0bba7+983rRraC9xiXkUAhk8sAUAsAUAsAUAsAUAsAQCxBACxBACxBACxBADEEgDEEgDEEgDEEgAQSwAQSwAQSwAQSwBALAFALAFALAFALAEAsQQAsQQAsQQAsQQAxBIAxBIAxBIAxBIAEEsAEEsAEEsAEEsAQCwBQCwBQCwBQCwBALEEALEEALEEALEEAMQSAMQSAMQSAMQSABBLABBLABBLABBLAEAsAUAsAUAsAUAsAQCxBACxBACxBACxBADqWFen0IOfe8iRAqB5LPnAu2YcJgAqeMYFgFgCgFgCgFgCgFgCAGIJAGIJAGIJAGIJAIglAIglAIglAIglACCWADA66zrV2sNH5ldtX36rn3b2TJybYsma6UYuDbjACdud4hkXAGIJAGIJAGIJQAghbNy0d+AlV6c+9K9zn70vLS33LLNp8xuTxPXnvpLOrFiEIfXMwt44QI17cmHFklbU2WadM3F1znfEkmZ962Z1gc1b9j33zJl4tsR0et2kTM9NQZ89s6I3ruZJUb9iyabqnB2rc/o4ScWSYVmsvE+Zmt5/7erpWOba1dObt+y7dvV0et1MGRhSz8z3xpiemt4fi8X+mZSPLyX5SWZmm+nFJDBktpPfeM2Kpc+RsirFkoX5ZXvv2eS4x3QZ56lYMrL7lHyBmLO0dHPrtpkrl+fd7LA6PbMwc+u2mRDClcvzSXpqen+STq6k+fzMRtKLVy7PJ1065ldsvM55lD5ZKqpUll+291jJzLrpdHJ6xtWToOJUXTWd++x9cXG54i9fIOYkHfrSwqnqLcTC0GfPLOuuIYTYDy8tnEp30Xz+pYVTW7fNJJmFHTizl7KN16xYPlG/qhX5mWoXlq9feYxLRjwuuXhhbvuOAxcvzOk3jHBcUjHU6JlfZy/ViysdlzSuamZx566DKypf83xHLGl8xi6vtEDMWVpaPn/u5M5dB8+fO6nrsAo9s6y7ZvLTXTSfn+60dfZSvVizYtVVWmn+7j2H0ifd7j2Hem6n5vmOZ1xNnyQs3az4e+Lsid17DsXF3XsOPXH2RJKO62bKFP7pWPTfM/O9MfaumN6951Bctyw/vhQ7c6avZhbLNlKnYvnEiqpakZ/eXWF+zcrracYlqzQuefzM8Tv23hnT6fIxnZR5/MxxHYih9szC3vj4meNv2ndXYS8t673JKmWdP1kl6c8VG69zmuQTFVWq37RM4XRb0tVLr17dZAZu4ujRoyGEi9ee7sLPlB4+Mn/wDTtWYUdzTy742VdWuWfOHLh7/tSjzV4dkpHs1Lm5yl1359TtxiUwVj2zbCMHD7117uSxkXR+Z1xH+B4XjE/PzG/kzrvuCSGcOP6lUfV8Z5xYMp78Hyzj2jOPHXskv5Fjxx4ZYbcvrBJiibs/0DNBLAnhq09d8a6jZ4JY0pzvb6BnwjCYCwsAsQQAsQQAsQQAsQQAxBIAxBIA1raX/7/k8JF5hwOABv4fghVtrZyD8xEAAAAASUVORK5CYII=" /><br />
<br />
<br />
If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:<br />
<br />
<img alt="" src="data:;base64,iVBORw0KGgoAAAANSUhEUgAAAakAAAEsCAIAAAD2M9tnAAAAA3NCSVQICAjb4U/gAAAAEHRFWHRTb2Z0d2FyZQBTaHV0dGVyY4LQCQAAF8VJREFUeNrtnV9sXNldx4+rEa0aE1ZgYRoRxyEKcRNl48TBxm7SpRWLaFktD5SF3bBSt1rRSoiKB3hgH+CBB3jsAwJVWipYMFtFBYSlUnnRajfrKJaH2OtkG69d4zjxrqJaNSVtExDVjM3DaU5Ozr977r+ZOzOfj0bWuef+zp977r3f+f3u9ZzTt33nbaHxZ3/xN/d/+P2+D+3pmZ/7g+eFEF/50t9/5GcHv/jcBdEtDB44bRw+AHQ0X/36f1x4ejxsMz1T/+1f+4Vas7mr5770Ry/IxJdfnhFC3PnO+1NPnRVC3Pv+vQ/v/sTv/dazhr3iwMExI+fOe4tlHNuBg2O+mgO7fPgOBwA6lGZzL+bGrzUe2F343S8ePX5Epk//4qknX/iYbnrxK//ykaHBRrDSrVtX9c1GRA+yEag5baPldRIA2qR9u8obU86Qsdlo7j30+1756y/97Stfv/Od9/v377u5vHVzeUvm/8YXnhJC/O/OD1/8/WfCXpK99/CR8c2NurGp/spMw0ClVb5hL4QYGj5rFDT6ENOEzFR9tpt2dp4LC6Di6A7NgYNjW7euSsXQdzWbu7VG46Hd7zz3aZn4h3/8t/e+d7N/f/+nP/OkEOIbX3vt50YP6pbuJi2D9bWFw0fG19cWhBBHj02sry1IGz1TT8siRr5ur+pxNqdnOovrTUixk/bOpn2dB4Dq+33K+1HCt7lRV75Oo/HA7/urr76iAt7+/fuefOFjQjyMeVeufuvP/+SPEx+NSQVRrK7MG/6gSqyuzMv06sr8yPHJ+Hyfg2nv8hVJ27SdAIBK+30PfJT1tQUlSobv0mzu1uRzwc//5vN/+eW/+1D/B+WO62+9e+vWrc+/9Nn+/f3ffn/74xOfiHl8eOOdK4/K0J7MHDk+KRPNhw7nnm2ZId+lfRmrkp2M6TwAVN/vM27qo8cmlDcmb/Ba44HdF158Xi//hy/9af/+fiHEW69d+fH9+xoRXo/T5tTouWvLl4UQJ05OyYRtqTbT5gf6kLYq1Tc939l5AKi29u3Je1b5ZDI9cnxS+WcN5fcZ/NMb/zr+ydMy/d/f/t5nf/25uNfGe5G+2KnRc0uLc0KIM2PnY/IL9PsSm5a7VNpXIQBUNObV/Jtry5cbzd1ry5dPjZ7TdzWbe7VGw+FD3X+/8auf+7hMD/zkgNPGRqqJor5waXziifrCJVm8vnDpzNj5+sIllVZmymB84gk7XwhhdEDayKqsUN9dxG5CplW+3n/ZtK/zAFB9v0+6L8plMTYbjd2++Stv2oUvzvyz/NcWIcQ3vvbad2/9QAjx0z/1M8PDw//5XzeemnoqT88mp37J2SgAQE7qN741+fixsM389bXxEz9fc76+3Pdj+9dXNuSb30995ldU/vrKxncXf9CcyPvGk3emAFAS89fXYiSo5vxhwy9/8hNvvPHWWzPz6oe9H2x8+P9q/yOEELUCfgvBrykAoAxOHT0SKUF9r7/+74wXAPQatUZjd3XrNgMBAL3DyNChmkwlzvoCANAdTM/UhRAfYCAAoAdB+wAA7QMAQPsAANA+AAC0DwAA7QMA6CxqPXjM8r97AEAh/8O3y26N8L8t13rzTL/47DNc7gCSl1+9GKkX3eTiEPMCQC+C9gEA2gcAgPZVk4HB0YHBUSOnp87Zvv7HuHABchL7rkPpy872coxxjFmG4vqunK3YOnL/3l3nXpVvi45zV2I9AJXFXpNr8MBpfXP7zts9pH2FK04hFCV8UpJinClDvFSRff2P6SKobwJ0mvY51pO4896iLoX6ZjfHvIbY7WwvKx/QGXuqv4aZzNGLRBa3DXRLZ0TsLB6Ws/v37hryF5Ywfa9u5nT6fPXYLSr1lB99k9sSWkCjuWd8jEy1eeDgmPwrEypt5KiCzrRh76snw6cA7cvmju1sL+uKKQVU5oRVyVk8cZcujnZDGR4IFuW7hevRBdewvH/vrtqrpwHK9vuMj5GpNoUQQ8NnNzfqmxv1ZnN3aPisEEJuyl1GWTutim9u1KW9r54MnzZoXzg+LTteTtuQLj2FKFohohl2JAFK9Psae8ZHCHH4yLj6rK8tqHyVNjbX1xZkVetrC4ePjDcae7KgnpZFVHFntaqeDJ9invd1EE4Xz6eD0plSvpXPBcusiTH16M4dNx5U83nf6sq8Sh89NqE2DWPfpnL0jL+rK/NHj004m0jsUn66TfvS+pUxT+IiFc2ZH6jHyOQNCVRD+/YSM9WmL99pZv8VQtx454pMjByfVGmVCHQpPx+IURPjBYVTX8LP1BIfvRX1P3rOdym+yn3CJJ+vyU+kR+ZULl89PLmD6sa8zV3jY2SqTTv/xMkpmT5xckrtvbZ8+cTJqWvLl420NHNWq+frddqZRtroTwF+ny5/vne+tkSKR5++2TXEF0/l9MX/K6Ie50b6XAHnLvwvfom1yc7g+kEF/b5To+dUemlxzun3LS3OnRk7ryx1M8NSpZcW52z7pcU5vUVfc750PH2zs7OrW7dLnbyhOv8SKJmeqTOPC4Di5VcvqjmsJh8/1h0HNX99zSdr0zP1h+vzAgCI0h6uVRC0DwB07dvtkSNthfZVKuAFgAAN/D4AwO9D+7oNfZJuAFB88+Ym2te1dM2KBADcGplh3mYA6EXQPgBA+wAA0D4AALQPAKB7SH7Pm7i8uaQr3xBlm1IBALpB+2J0zaePqVZ3K1az1JwuaZvOU9ZXVYa9ANB+7cv28+Y2ru6Wp60C+4m0AVSWqOd9GZYF8a3u5ltW3Lm+WswCb87V3QJLviUu5Ja4XJyzksCac0YR+9B8NgDQZu3LvxxcoocoXOuriUcXeItxMw3BFY+u6xZoyOmyOZt2VmIs3WmUUn1TZs6+GTYA0Hl+X0BTnBrhW1/NnuRZltLTqQLMQlaMy1NJTBGCZYCyiXreF7PgWyGuXwsopCHfanC+98K+Sf/T2gBAS7WvBdPatOxuL6Qh32pNgRfEMa992vVqCICY16d9e+FP2Iuxg1P5RF+/vX2vMuw61ZOybAIR01DhlRRlAwCtjXkz+X3xQVza9dUy6JRSzPwNOSuxxT2woJ2vb8S8AC0jeZ226Zn6+TMfDdcyt/Quk+IBQEeQYp22RmOX8QKAnot556+vMVIA0FvaRzALAN0Hc1gBANoHAID2AQCgfQAAaB8AQCfDnPUVgp/xAlRI+0QRc9YL/++0uOEBoKLal3/O+oDGIXwAUFntS/2bNlvpnFMl6z/71yclDfuJtoHTwTQy7YK+FvV8u9vOduMzY4wBoBLal39W+vBtbyumPcmVXsSYOsXItOdQ8RUMtGiknfX76nFm+mrwfSsAQKm0Ys565xI84Qk+A06Q0yZxyntfnTEFYzqWGM7HTMrP5QhQMb8v35z1vhWIOmiYwpPUC9e6IolOLgBUXfuKnbO+E1/sppplPpwJAB0U8+adsz5S+HzL3SbaxBTM3GKqtsLL/hbYcwAoP+bNPWe98L9+TYwibQNjOTQ9M62HFbmCmnC9kI2PecM12Cv8AkBFYt6Mz/sCbzMC8pch/Ix/1RDTYrYm4gPkmE4CQAX8PuasB4Ae1D7mrAeAntO+lk1S0PqIjxgToGdhDisAQPsAANA+AAC0DwAA7QMA6GSYs/4h3TeDtPEDmB45aoBitE9kmrPe/nlWSTcYt253jwznF9qpfYHftA0eOL19523nLi5ZAOh07XP/pu3AwTGRZoYr5woeganqfVMhGD/7D0R2MTUYBROXGYmc0T7yuETELA/x8/gbXfKNTODYIzvgG6XEdQJSDZfzKJjlH1qnfc4564eGzwohtm5djZ/R3l6dw75zYmai96Wdd3ieGmJEPKCVMfPXJ/Y2savOgvpQh+1jOiBc0/2HT7RIv8CAnbaPInGdAIBIYuesHxo+OzR8Vk1SL4Vvc6MemLNeTVVvT1jvi44j53OPv9Az1KB7HzFry8VMr5/zeDNXWPiAxwtfwD7/acXvg1b5fQ/mrJeJo8cmhBDrawvhuewDE/B12YVb1Lx78TPj9+xwBaZHBCjxeV+zuTtyfFIIsboyn20i+46IUwIBYzj+zX9jh/O7I7jLeQhdNhpQ8Zj3R/6dFL4b71xJnLM+myNgr+WWU1Dy1JDtOWAhx5u252kPM+2AZ5umP2aU8pQt3O8G/D4r5tVcvGvLlyOnsA8sQ25sBiZ/jwn3fEF0UXPZp4rCEksZvTLcmfiF3zIfpm+ifF89gXUC0o5V4DATL6TApQKQgb7Z2dnVrduB/16enqmfP/PRM2PnhRBLi3NOm7mldzP/rqOykUtrOkbgBtBipmfqI0OHYuesry9cEr00eX2pkuT8bzgAqFzM21Nz1rfgNTSBG0AHaF/ZkxRU7eZvTX+QPID2whxWAID2AQCgfQAA3UotW7HICU0BANpF+F1FLXO9Lz77DIMLANXk5VcvEvMCAKB9AABoHwD0JrX8Vezrf0yl79+769yr8nVjo0iqegAA2ql9uiTZumZjiJcqsq//MV0E9U0AgCrGvEqk7t+7a8hfWML0vbqZ0+lDCgGgWtoXI21VqAcAoEjt0+PWQhQNsQOAqmufetJnCFZm/SqqHgCAAAW85w24b2kVzZkfqAcAoD3aZ7yfdb67iBQspxk+IABUUfv0d7uRwhRw7sL/4gcA0AExr8/AaR+jdKghABQFv2kDALQPAADtAwDoVrI/70ucGhAAoNu0r+yFKwEAiHkBANA+AAC0DwCgO7VvYHCU8wQAxZL9Pa8hSTvby5USSqM/znyfsW4gd+kH6zS2q4rvSWAkI7sdOJBUxhnsAXpO+yp4MyipMtKqq7rKBIzTirtdVaqeOHUtvtuBtlIZZ7AH6FHti1Qfw3Ewbvvw3oBLkvNWDJdVd368k1js6MV3O1AkbVdTVQ7Qu9pn65TyFPTb2FZDpysRuVeli7otw+6P0zWz97b+oaQxIPExb+aHBuggoH0hp8AQPlHcy4oy9MV3S6u2Eg1svS7P6fMZB8LSxIg1cCBEu4D2pb6H9YA3vy6UFH/5bumYez6sIy3z+MKdTBtE+557InyA9iXEvHrc6nxJmjnsMuJK27sUnqeHRh9kQlnq9mERCaieL/aP7ElAzmK6HWjL3mXnJA6g4FUvdCN9s7Ozq1u3q/P73GK9jDyBZBuPK2dP8NQAAkzP1EeGDnX57zpSSUCpetHKniB8AIlUTvu4bwGgF7UPAADtAwBA+wAA0D4AALQPAADtAwBA+wAA0D4AALQPANA+AAC0DwAA7QMAQPsAANA+AAC0DwAA7QMAQPsAANA+AAC0DwAA7QMAQPsAANA+AAC0DwAA7QMAQPsAAHpd+wYGRytbW2SFbWm04j3hKu2msR0YHFWfyGPJfHS1mN7sbC87N/VWZabRj53tZTsn5pB8ZhmGUlVVVJ0lUVL3jNNXtQu9jLNjX5aFjExRIxmup10nSw2a70b2DWxiZrjONo5ALfPZkkeijkraqE3dPvJgKq5N0EF+k3FZMizxN7jvRg4MrFE2cDvrlm0/NVHaJ903u6P6cabyPCMP2z4l4S8Z59eLOmcxNdjnzKg8/gvT91XpqzCye75GndefNDOu2kDrMf2P9xdiXIPw2bF7m3g6fJelM17RW7dPU8A+8wWQeEZ8l0H8Jeqz9F0k8S5I5vvdJynOa0mNQOIVm8HZT619hccg4btINzMGwv6ScZ7R8DkO1xA2Dn+VBVrXz2uqCmMqcQ6U80vL17eAhkaGjb7DDzwkiTk7iec6z72XODKJvQ0/EUockEC7kSciw/UfVqIMo+osm0GM7MEJfKmn+jrMpX2+c19gMOsTgvb6xonfb2m/AAt58OzU6JiBcn7Z+i7Z/HdF609HhwbpZcuTz5OI/45PvHPLe2Rf0vjU0o5dOFAq6UlEu+Qv7BLGGJT0WNPngIQHyvdtGX5MU/YpLvx7tAWXZcsusNaciLTPoJx9SHWfxh9F/P2Sdnxqaa88456J0drww5QYs0D8n6j3zmc0iY8eCvH7nK+508YCMZU4B8o5AuGrNvDELfOXbczwpoqJwjeb77IMPGBKvK4CvTVGINL7SDwj8ZdZKn8n4PRFnsewmfM+jXywHnnGA35YWjXvm52dXd26feHp8U4JEEp1AMurnxeOVR6E3jk7XIdCiOmZ+sjQoZqAHI9L21gzwwsZzgKnIGPM2xGPe6pTLZdapwxCL5wmLkUDfs8LAGgfAADaBwCA9gEAoH0AAGgfAADaBwCA9gEAoH0AAGgfAADaBwCA9gEAoH0AAGgfAADaBwCA9gEAoH0AAGgfAADaBwBoHwAA2gcAgPYBAKB9AABoHwAA2gcAgPYBAFSVWqLFwOCovrmzvcyoAUD3a5+hdwODo8gfAPSE9vn8wZ3tZekVSjXUPURDLm0lzVA8MRMAoHTtM3xAQ8XULp+6xRfXbSIzAQDCRL3rkGKkS5LT51KbafPDZj57/D4AKNfva42+GF6hatoOb32ZnE4AaEXMWyyJ7p7udTozAQCKjHnTOm72c72AZ+f0+8JF4jMBAMr1+2Qc6otPE4XJeBmSGN7ynhcAytW++DcPMZZhzcqTiQgCQHti3rSxMABAx8e8ieEw3hkA9Jb2IXkA0IsxLwAA2gcAgPYBALSDhOd90zN1xggAOpELT49n177E8gAAFSTRbyPmBYBeBO0DALQPAKA3SH7ex+sOAOg57eNFBwAQ8wIAoH0AAGgfAADaBwCA9gEAoH0AAB2gfSyKBgA96vf51pwEAOhQCpizXpfCxCXZwpmBfPHoeuTxLQIAZPT79MWGBgZHbbnZ2V42FtiVZvITn2nk6zVnbpFzDAAZtS9RGcPeVniB3YCP5rPJ1iIAQOqYVzlTtqw4fSvnupS+xSrTemepWuQcA0B27Yvx+3z5ephsZzqD6AJbBADIFfPaj95stQo/aEvUNd8L5cCL5pwtAgB+X3anLzK8jck00rZ4pWoRAMBJ3+zs7OrW7erP00cYCwCFMD1THxk6VOnftPGvKgBQ0Zi3VAhjAaAXtQ/JA4CSYB4XAED7AADQPgAAtA8AAO0DAED7AADQPgAAtA8AAO0DAED7AADQPgAAtA8AAO0DAED7AADQPgAAtA8AAO0DAED7AADQPgBA+wAA0D4AALQPAKDbSFijcnqmzhgBQCdy4enx7NoXLgwAQMwLAID2AQCgfQAAlaLGEACE4Y2fcD36r86wZHstgfYBlHV3db36V2FYMkswMS8A9CJoHwCgfQAAaB8AQCEMDI5WrUu86wBIptncszMHD5xW6e07b6tMlY4nW6liydAH57DoI6NX6DNG+wCqrH27Rs6Bg2N33lvU73a1aRtna6IKh5nBXh+Z/MOC9gG0k4bLZ9Ezt25dbTT3hobPyptf5shdMlOZqcytW1f1XUYpHbsGWVw3kLXpdRoGOfsQOSyyTpWvp1XC2Rk9P7HnaB9AOx2ioeGzmxuP/HPZ5kb98JFxmSmLqE27iJ42SukcPjIua5ZpWWpzo66Ky7KyoJ6pp+0aUvUh1bD4iqsBcXZGH6iYnueHdx0AEQ5OY8/4rK8tyBtSflS+Yby+tqBvqr1CCH2XswnbUjZqV5VomZif2IfIYfFVEtMZe9B8xjF9w+8DKMvvW12ZV+mjxybUpmE8cnzSWZVhFvC2jh6bsC1XV+Zl/urKfGKdafPL8/t8jarDsQc2Tw/RPoCc2rcXb6Mbnzg5deOdK/qm0yzchF6DslQ1jxyfVAa+OtPm5xkWXyWJjaqj0I/IeezEvACtinmbu8bnxMkpfVPZGMb65omTUzFmxkfKnN2uXZVuabTly4/vQ+SwXFu+nNhbZ2f0Uk5je8zDfcPvAyjF71tanDs1ek7fVDYyf2lxzjBbWpw7M3be5/7opYyG1F7V0Jmx86pF2YSzOWVwZuy8nR/fh1R+n7MPytjXGV/P7WMv6pz2zc7Orm7dZm56AB/TM/XJx49Vv5/jE0/UFy6VUfP89TXnHFZVGBZn3xJP6MjQIfw+gIwODv3slGEh5gXIfJPv0s/OHRa0DyAjjU5wcN588/UW97OB3weA38ewoH0A3cY3b24yCF02LGgfQAL8F0RXDgv/2wwAvQjaBwBoHwAA2gcAgPYBAKB9AABoHwBAZ/Gj/++bnqkzFgDQO/w/uRYxG8i0I/YAAAAASUVORK5CYII=" /><br />
<br />
In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:<br />
<br />
<pre class="xml" name="code"><field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<tv>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />
</tv>
</field>
</pre></div>Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0tag:blogger.com,1999:blog-7093676993132865793.post-10224220365041745312014-06-09T19:19:00.001+03:002014-06-09T19:19:03.384+03:00Low-level testing your Lucene TokenFilters<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
On the recent Berlin buzzwords conference <a href="https://www.youtube.com/watch?v=KyA44hBB5t4&index=3&list=PLq-odUc2x7i-Q5gQtkmba4ov37XRPjp6n" target="_blank">talk on Apache Lucene 4</a> Robert Muir mentioned the Lucene's internal testing library. This library is essentially the collection of classes and methods that form the test bed for Lucene committers. But, as a matter of fact, the same library can be perfectly used in your own code. David Weiss has <a href="https://www.youtube.com/watch?v=-uVE_w8flIU">talked</a> about randomized testing with Lucene, which is not the focus of this post but is really a great way of running your usual static tests with randomization. </div>
<br />
<div style="text-align: justify;">
This post will show a few code snippets, that illustrate the usage of the Lucene test library for verifying the consistency of your custom TokenFilters on lower level, than your might used to.</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://p.blog.csdn.net/images/p_blog_csdn_net/caoxu1987728/EntryImages/20081113/5080_200706021800441.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://p.blog.csdn.net/images/p_blog_csdn_net/caoxu1987728/EntryImages/20081113/5080_200706021800441.jpg" height="213" width="320" /></a></div>
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, sans-serif; font-size: 13.63636302947998px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 18.18181800842285px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: auto; word-spacing: 0px;">(Credits: http://blog.csdn.net/caoxu1987728/article/details/3294145 </span><br />
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, sans-serif; font-size: 13.63636302947998px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 18.18181800842285px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: auto; word-spacing: 0px;">I'm putting this fancy term graph to prove, that posts with images are opened more often, than those without. Ok, it has relevant parts too: in particular we are looking into creating our own TokenFilter in parallel to StopFilter, LowerCaseFilter, StandardFilter and PorterStemFilter.).</span><br />
<span style="-webkit-text-stroke-width: 0px; background-color: white; color: #333333; display: inline !important; float: none; font-family: Arial, sans-serif; font-size: 13.63636302947998px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: 18.18181800842285px; orphans: auto; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: auto; word-spacing: 0px;"><br />
</span> <br />
<div style="text-align: justify;">
In the naming convention spirit of the <a href="http://dmitrykan.blogspot.fi/2014/03/implementing-own-luceneqparserplugin.html" target="_blank">previous post</a>, where custom classes started with GroundShaking prefix, let's create our own MindBlowingTokenFilter class. For the sake of illustration, our token filter will take each term from the term stream, add "mindblowing" suffix to it and store in the stream as a new term. This class will be a basis for writing unit-tests.</div>
<br />
<pre class="java" name="code">package com.dmitrykan.blogspot;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute;
import java.io.IOException;
/**
* Created by dmitry on 6/9/14.
*/
public final class MindBlowingTokenFilter extends TokenFilter {
private final CharTermAttribute termAtt;
private final PositionIncrementAttribute posAtt;
// dummy thing, is needed for complying with BaseTokenStreamTestCase assertions
private PositionLengthAttribute posLenAtt; // don't remove this, otherwise the low-level test will fail
private State save;
public static final String MIND_BLOWING_SUFFIX = "mindblowing";
/**
* Construct a token stream filtering the given input.
*
* @param input
*/
protected MindBlowingTokenFilter(TokenStream input) {
super(input);
this.termAtt = addAttribute(CharTermAttribute.class);
this.posAtt = addAttribute(PositionIncrementAttribute.class);
this.posLenAtt = addAttribute(PositionLengthAttribute.class);
}
@Override
public boolean incrementToken() throws IOException {
if( save != null ) {
restoreState(save);
save = null;
return true;
}
if (input.incrementToken()) {
// pass through zero-length terms
int oldLen = termAtt.length();
if (oldLen == 0) return true;
int origOffset = posAtt.getPositionIncrement();
// save original state
posAtt.setPositionIncrement(0);
save = captureState();
//char[] origBuffer = termAtt.buffer();
char [] buffer = termAtt.resizeBuffer(oldLen + MIND_BLOWING_SUFFIX.length());
for (int i = 0; i < MIND_BLOWING_SUFFIX.length(); i++) {
buffer[oldLen + i] = MIND_BLOWING_SUFFIX.charAt(i);
}
posAtt.setPositionIncrement(origOffset);
termAtt.copyBuffer(buffer, 0, oldLen + MIND_BLOWING_SUFFIX.length());
return true;
}
return false;
}
}
</pre>
<br />
<div style="text-align: justify;">
The next thing we would like to do is to write a Lucene-level test suite for this class. We will extend it from BaseTokenStreamTestCase, not standard TestCase or other class from a testing framework you might have used to deal with. The reason being we'd like to utilize the internal Lucene's test functionality, that lets you access and cross check the lower-level items, like term position increments, position lengths, position start and end offsets etc.<br />
<br />
About the same information you can see with Apache Solr's analysis page, if you enable verbose mode. While the analysis page is good to visually debug your code, the unit test is meant to run for you every time you change and build you code. If you decide to first visually examine the term positions, start and end offsets with Solr, you'll need to wrap the token filter into factory and register it in the schema on your field type. The factory code:<br />
<br />
<pre class="java" name="code">package com.dmitrykan.blogspot;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.util.TokenFilterFactory;
import java.util.Map;
/**
* Created by dmitry on 6/9/14.
*/
public class MindBlowingTokenFilterFactory extends TokenFilterFactory {
public MindBlowingTokenFilterFactory(Map<string string=""> args) {
super(args);
}
public MindBlowingTokenFilter create(TokenStream input) {
return new MindBlowingTokenFilter(input);
}
}
</string></pre>
<br />
Here is the test class in all its glory. </div>
<div style="text-align: justify;">
<br /></div>
<pre class="java" name="code">package com.dmitrykan.blogspot;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.BaseTokenStreamTestCase;
import org.apache.lucene.analysis.MockTokenizer;
import org.apache.lucene.analysis.Tokenizer;
import java.io.IOException;
import java.io.Reader;
/**
* Created by dmitry on 6/9/14.
*/
public class TestMindBlowingTokenFilter extends BaseTokenStreamTestCase {
private Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new MockTokenizer(reader, MockTokenizer.WHITESPACE, true);
return new TokenStreamComponents(source, new MindBlowingTokenFilter(source));
}
};
public void testPositionIncrementsSingleTerm() throws IOException {
String output[] = {"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {1, 1};
assertAnalyzesToPositions(analyzer, "queries", output, posIncrements, posLengths);
}
public void testPositionIncrementsTwoTerm() throws IOException {
String output[] = {"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your", "queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {1, 0, 1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {1, 1, 1, 1};
assertAnalyzesToPositions(analyzer, "your queries", output, posIncrements, posLengths);
}
public void testPositionIncrementsFourTerms() throws IOException {
String output[] = {
"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
"are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
"fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int posIncrements[] = {
1, 0,
1, 0,
1, 0,
1, 0};
// this is dummy stuff, but the test does not run without it
int posLengths[] = {
1, 1,
1, 1,
1, 1,
1, 1};
// position increments are following the 1-0 pattern, because for each next term we insert a new term into
// the same position (i.e. position increment is 0)
assertAnalyzesToPositions(analyzer, "your queries are fast", output, posIncrements, posLengths);
}
public void testPositionOffsetsFourTerms() throws IOException {
String output[] = {
"your" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "your",
"queries" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "queries",
"are" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "are",
"fast" + MindBlowingTokenFilter.MIND_BLOWING_SUFFIX, "fast"};
// the position increment for the first term must be one in this case and of the second must be 0,
// because the second term is stored in the same position in the token filter stream
int startOffsets[] = {
0, 0,
5, 5,
13, 13,
17, 17};
// this is dummy stuff, but the test does not run without it
int endOffsets[] = {
4, 4,
12, 12,
16, 16,
21, 21};
assertAnalyzesTo(analyzer, "your queries are fast", output, startOffsets, endOffsets);
}
}
</pre>
<div style="text-align: justify;">
<br />
All tests should pass and yes, the same numbers are present on the Solr's analysis page:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-WNFHcbneUjY/U5XdiaKIovI/AAAAAAAAMmg/QhD1GhEf6dw/s1600/MindBlowingTokenFilterAnalysisPage.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img alt="MindBlowingTokenFilter solr analysis page" border="0" src="http://2.bp.blogspot.com/-WNFHcbneUjY/U5XdiaKIovI/AAAAAAAAMmg/QhD1GhEf6dw/s1600/MindBlowingTokenFilterAnalysisPage.png" height="172" title="MindBlowingTokenFilter solr analysis page" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
Happy unit testing with Lucene!<br />
<br />
your <a href="http://www.twitter.com/dmitrykan" target="_blank">@dmitrykan</a></div>
<br /></div>
Dmitry Kanhttp://www.blogger.com/profile/18154816739397439235noreply@blogger.com0