Real-Time Full-Text Search with Luwak and Samza

felipesabino · on April 13, 2015

Samza's author really opened my mind related to how useful stream process is to high performance data processing [1] and I think Samza's only bummer (for me, personally, today) is it's lack of support for non-JMV languages [2]

[1] https://www.youtube.com/watch?v=fU9hR3kiOK0

[2] http://samza.apache.org/learn/documentation/0.7.0/comparison...

crazed_climber · on April 13, 2015

I totally agree! I can't wait for Martin Kleppmann's "Designing Data-Intensive Applications" to be complete. I've read the chapters available through the early release and highly recommend the book based on what I've seen so far!

http://dataintensive.net/

d3fmacro · on April 13, 2015

Apache Storm has non-jvm languages https://storm.apache.org/

felipesabino · on April 14, 2015

Yap! Also, Spark added a Python API for Spark Streaming after v1.2 [1]

[1] https://spark.apache.org/docs/1.2.0/streaming-programming-gu...

vosper · on April 13, 2015

I think this is the first post I've seen about Samza from a non-LinkedIn team. I'd love to hear any details about the Samza experience - it seems like it should be the logical choice for Kafka users, but there's not much out there about it.

rb2k_ · on April 13, 2015

I don't think Confluent really counts as "non-LinkedIn team" :)

"Jay is co-founder and CEO at Confluent. Prior to Confluent, Jay Kreps was the initial developer on several open source projects, including Apache Kafka, Apache Samza, Voldemort. He was the lead architect for data infrastructure at LinkedIn."

(Martin Kleppmann also has a LinkedIn background)

vosper · on April 13, 2015

Ahh, thanks - I checked that the speakers weren't currently working for LinkedIn, but I didn't look further into their backgrounds.

Oh, well. I still hope to one day read about someone else using Samza in production.

martinkl · on April 13, 2015

Here are a few production users: https://cwiki.apache.org/confluence/display/SAMZA/Powered+By

The Metamarkets team wrote a nice post on their use of Samza a few days ago: https://metamarkets.com/2015/simplicity-stability-and-transp...

felipesabino · on April 14, 2015

Interesting, I would be very interested in learning more about Metamarkets transition to Samza, as 1 yr ago they were using Storm instead [1] [2]

Or may be they did not transition and are actually using both, I don't know

[1] https://youtu.be/3Qb_2GGRz24?t=20m24s

[2] https://storm.apache.org/documentation/Powered-By.html

AznHisoka · on April 13, 2015

This is long long long overdue in SOLR. The percolator feature is what has made me stick with ElasticSearch for the past 3 years, and has contributed to its increasing popularity over SOLR.

huskyr · on April 13, 2015

Bit offtopic, but does anyone have suggestions for simple full text search engines? I basically want something for names, just a couple of thousand, nothing fancy. Setting up something like ElasticSearch seems like overkill (and is quite hungry for specs as well). I was thinking about simply hacking something together with Redis and Python, but i suppose someone might have a better solution.

ignoramous · on April 13, 2015

Sorry if come off as naive (as I don't really understand what a 'full text search' is), but for a couple thousand names wouldn't grep with regex wildcards suffice?

huskyr · on April 13, 2015

That doesn't sound like a bad suggestion at all. I think it might get bigger over time, or i might have aliases for the names, and in the end using just regex i'll probably hit a ceiling sooner than later.

YorickPeterse · on April 13, 2015

If you're using PostgreSQL you can take advantage of its full text search support. When doing so, make sure to save the text search vectors in a physical column as otherwise queries will be quite slow.

huskyr · on April 13, 2015

I would definitely take that option if we would use PostgreSQL, but we don't. PostgreSQL sounds like a wonderful piece of software, but they need to work on making it more accessible and user friendly. I tried getting something up and running, but given that there isn't a proper native client for Mac i gave up :(

cglace · on April 13, 2015

It is really easy to install using http://brew.sh/

huskyr · on April 14, 2015

The command-line part and executables are indeed, but it's lacking in visual tools. Say, the equivalents of Sequel Pro and Phpmyadmin.

ewams · on April 14, 2015

http://www.pgadmin.org/

http://phppgadmin.sourceforge.net/doku.php

richardbrevig · on April 14, 2015

What exactly are you trying to do?

ElasticSearch isn't that difficult and the new guide they published a few months ago walks you through getting it set up without going too in-depth [1]. As a beginner I had it running on a $5 digitalocean droplet within a couple hours and indexed way over "a couple thousand" documents before the end of the day.

But if your needs are really that simple, MySQL does support full-text search [2].

[1] http://www.elastic.co/guide/en/elasticsearch/guide/current/i... [2] https://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

huskyr · on April 14, 2015

Yes, i've looked into ES before, and if my usecase would be anything more complicated i would definitely go for that route. However, it's still quite some work to setup, and they recommend at least 8GB of RAM for a working instance [1].

I didn't know MySQL had fulltext search as well. I'll look into that. Thanks.

[1]: http://www.elastic.co/guide/en/elasticsearch/guide/master/ha...

mtrn · on April 13, 2015

I've worked a bit with a nice search library in Go called bleve[1]. That said, it is a library and you would have to implement a server component yourself.

Bleve comes with a few command line utils: https://github.com/blevesearch/bleve/tree/master/utils

[1] http://www.blevesearch.com/

huskyr · on April 14, 2015

Thanks!

ddorian43 · on April 13, 2015

Python has a search-engine: whoosh

huskyr · on April 13, 2015

Whoosh sounds pretty much what i'm looking for. Thanks!

mhuffman · on April 14, 2015

Also, if you are using Python, sqlite3 has some pretty advanced text searching capabilities that will scale past your "few thousand" with no problem.

gearhart · on April 13, 2015

Great write-up.

If you're interested in this and you live in (or would like to live in) London, we're hiring. Email's in my profile.

phpnode · on April 13, 2015

offtopic - I've seen a number of blog posts using this hand drawn diagram style recently, does anyone happen to know how it's done? (answers other than "by hand" appreciated)

spdustin · on April 13, 2015

Looks like Paper, by FiftyThree:

https://appsto.re/us/KfqkE.i

martinkl · on April 13, 2015

Correct. If you're going to try it, I recommend getting a stylus for your iPad, since handwriting with your fingertip doesn't work very well.

spdustin · on April 13, 2015

Pencil (by FiftyThree) works great, of course. And unlocks the additional features in the Paper app. I'm actually quite impressed by how well it works.

http://www.fiftythree.com/pencil