Data IO 2013 conference – my notes

On October 30, 2013 by Mike With 0 Comments - Conferences

These are my notes from today’s Data IO conference

Next Generation Search with Lucene and Solr 4

Speaker’s slides

Lucene 4

near real time indexes (used by Twitter for 500 million new tweets/day)
can plug in your own scoring model
flexible index formats
much improved memory use, regexs are faster, etc
new autocomplete suggester

Solr (Lucene server – managed by the same team as Lucene)

if someone chooses the red shirt, do we have large in stock (pivot faceting – anticipating the next question)
improved geo-spatial (all mexican restaurants within 5 blocks, plus function queries to rank them)
dstributed/sharded indexing and search
solr as nosql data store

Uses

recommendation engine (LinkedIn uses Lucene for people recommendations). Recommend content to people who exhibit certain behaviors
avoid flight delays – one facet – flights out of airports, pivot to destination airports (Ohare to Newark) – origin, destination, carrier, flight, delay times – look at trends over time. Solr has a stats package – you can get averages, max, min, etc
for local search, how to show only shops that are open? (Yelp also uses Lucene).

You added Zookeeper to your stack, now what?

Old way of system management: active and backup servers, frantically switch to backup when active fails

Common challenges with big distributed system

Outages
Coordination
Operational complexity

A common deficiency: sequential consistency (handling everything in the “right” order, when data is coming from multiple places)

Zookeeper is a distributed, consistent data store – strictly ordered access
Can keep running as long as only a minority of member nodes are lost (usually want to run 3 or 5 nodes)
all data stored in memory (50,000 ops/sec)
optimized for read performance (not write); also not optimized for giant pieces of data
it’s a coordination service
A leader node is elected by all the members. Leader manages writes (proposes it to followers, they acknowledge it, then it is assigned and written)
nodes can have data, and can have child nodes
has “ephemeral nodes” – created when a client connects, destroyed when client disconnects (these do not ever have child nodes)
watches: clients can be kept informed about data state changes (just lets you know it has changed, but not what it’s changed to – you need to request it again if you want to know the current value)

Zookeeper open-source equivalent of Chubby

good for discovery services (like DNS)
Use cases: Storm and HBase, Redis – http://www.slideshare.net/ryanlecompte/handling-redis-failover-with-zookeeper
Distributed locking

Beware – Zookeeper can be your single point of failure if you don’t have appropriate monitoring and fallbacks in place

Graph Database Use Cases

nodes connected by relationships
no tables or rows
nodes are property containers
cypher is neo4j’s query language
http://github.com/maxdemarzi/neo_graph_search
http://maxdemarzi.com/2012/10/18/matches-are-the-new-hotness
also used for content management, access control, insurance risk analysis, geo routing, asset management, bioinformatics
“what drug will bind to protein X and not interact with drug Y?”
http://neovisualsearch.maxdemarzi.com
performance factors: graph size (doesn’t matter), query degree (this is what matters – how many hops), graph density. RDBMS doesn’t scale well with data size, neo4j does
the more connected the data, the better it fits a graph db,
NoSQL – 4 categories – key value, column family, document db, graph db
popular combo is, e.g. mongo for data, neo4j for searching it (then hydrate the search results from mongo)
optimized for graph traversal, not, e.g., aggregate analysis of all the nodes
top reasons to use it: problems with RDBMS join performance, evolving data set, domain shape is naturally a graph, open-ended business requirements
Gartner’s 5 graphs: interest, intent. mobile, payment

Parquet

I didn’t take notes during those one (a drop of water from the bottom of my glass got under my Mac trackpad, and my mouse was going crazy for a while)

All the data and still not enough?

No matter how much data you have, it’s never enough or never seems like the right type
Predictive modeling – will someone default on a loan? Look at data for people who’ve had loans, and who defaulted and didn’t. Use data to make a predictive risk model
IID = independent and identically distributed

Example IBM sales force optimization

Can we predict where the opportunities are – which companies have growing IT budgets?
Couldn’t know what was most important – where were these target companies spending their IT budget (not disclosed)
Companies who are spending with us are probably representative of similar sized companies in the same market – use the “nearest neighbor” technique
Compared model prediction to expert salesmen’s opinions, except for 15% of them, where the expert’s put the chances at zero. Why the difference? The model mis-identified some of the companies (no good way to cross-reference millions of internal customer records with independent sources)

Siemens – compter aided detection of breast cancer

patient IDs ended up predicting odds for cancer. It turns out the ID was a proxy for location (whether they were at a treatment facility or a screening facility)

Display ad auctions – how do we decide who to target?

multi-armed bandit – exploration vs exploitation
what do we know? urls you’ve visited
for something like advertising luxury cars, very few positive examples (people don’t buy them online)
There is no correlation between ad clicks and purchases
Better to look at – did the person eventually end up at the company’s home page at some point after seeing the ad?
target people who the ad can actually influence (i.e. not people who already bought the product, or never will)
but there’s no way to get data for that
Counterfactuals – you can’t both show and not show someone and ad, and observe subsequent behavior. You have to either show it or not show it
Ideally, build a predictive model for those who see the ad, and another model for those who don’t
But the industry doesn’t do that – it’s all about conversion rate

Advertising fraud

Malware on sites generating http requests
Very difficult for ad auctions systems to detect
Detect by looking at traffic between sites. Foe example, malware site womenshealthbase generates massive traffic to lots of other sites, not about womens health
they make money by visiting a site with a real ad auction system. Then bid prices go up because of your traffic, which drives up ad revenue traffic on womenshealthbase
Auction systems now put visitors from these sites in a penalty box, until they start displaying normal behavior again

What’s new with Apache Mahout?

Amazon: customers who bought this item also bought this item – best known Mahout example
Mahout implemented most of the algorithms in the Netflix recommendation contest
In general, finds similarities between different groupings of things (clustering)
Key features: classification, clustering, collaborative filtering
Automatically tags questions on stack overflow

Uses

recommend friends, products, etc
classify content into groups
find similar content
find patterns in behavior
etc – general solution to machine learning problems

[I’m leaving it most of the details about performance improvements and the roadmap for upcoming refinements – below are other interesting points]

Often used with Hadoop, but it’s not necessary. Typically included in Hadoop distributions
Streaming K-means – given centroids (points at the center of a cluster) determine which clusters other points belong in
References: Mahout in Action (but a bit out of date), Taming Text http://mahout.apache.org
Topic Modeling He wasn’t sure what the full feature set is – he’s pretty sure it doesn’t generate topic labels for you

ToppaWare

Mike Toppa's web development blog