Data IO 2013 conference – my notes
These are my notes from today’s Data IO conference
Next Generation Search with Lucene and Solr 4
Lucene 4
- near real time indexes (used by Twitter for 500 million new tweets/day)
- can plug in your own scoring model
- flexible index formats
- much improved memory use, regexs are faster, etc
- new autocomplete suggester
Solr (Lucene server – managed by the same team as Lucene)
- if someone chooses the red shirt, do we have large in stock (pivot faceting – anticipating the next question)
- improved geo-spatial (all mexican restaurants within 5 blocks, plus function queries to rank them)
- dstributed/sharded indexing and search
- solr as nosql data store
Uses
- recommendation engine (LinkedIn uses Lucene for people recommendations). Recommend content to people who exhibit certain behaviors
- avoid flight delays – one facet – flights out of airports, pivot to destination airports (Ohare to Newark) – origin, destination, carrier, flight, delay times – look at trends over time. Solr has a stats package – you can get averages, max, min, etc
- for local search, how to show only shops that are open? (Yelp also uses Lucene).
You added Zookeeper to your stack, now what?
Old way of system management: active and backup servers, frantically switch to backup when active fails
Common challenges with big distributed system
- Outages
- Coordination
- Operational complexity
A common deficiency: sequential consistency (handling everything in the “right” order, when data is coming from multiple places)
- Zookeeper is a distributed, consistent data store – strictly ordered access
- Can keep running as long as only a minority of member nodes are lost (usually want to run 3 or 5 nodes)
- all data stored in memory (50,000 ops/sec)
- optimized for read performance (not write); also not optimized for giant pieces of data
- it’s a coordination service
- A leader node is elected by all the members. Leader manages writes (proposes it to followers, they acknowledge it, then it is assigned and written)
- nodes can have data, and can have child nodes
- has “ephemeral nodes” – created when a client connects, destroyed when client disconnects (these do not ever have child nodes)
- watches: clients can be kept informed about data state changes (just lets you know it has changed, but not what it’s changed to – you need to request it again if you want to know the current value)
Zookeeper open-source equivalent of Chubby
- good for discovery services (like DNS)
- Use cases: Storm and HBase, Redis – http://www.slideshare.net/ryanlecompte/handling-redis-failover-with-zookeeper
- Distributed locking
Beware – Zookeeper can be your single point of failure if you don’t have appropriate monitoring and fallbacks in place
Graph Database Use Cases
- nodes connected by relationships
- no tables or rows
- nodes are property containers
- cypher is neo4j’s query language
- http://github.com/maxdemarzi/neo_graph_search
- http://maxdemarzi.com/2012/10/18/matches-are-the-new-hotness
- also used for content management, access control, insurance risk analysis, geo routing, asset management, bioinformatics
- “what drug will bind to protein X and not interact with drug Y?”
- http://neovisualsearch.maxdemarzi.com
- performance factors: graph size (doesn’t matter), query degree (this is what matters – how many hops), graph density. RDBMS doesn’t scale well with data size, neo4j does
- the more connected the data, the better it fits a graph db,
- NoSQL – 4 categories – key value, column family, document db, graph db
- popular combo is, e.g. mongo for data, neo4j for searching it (then hydrate the search results from mongo)
- optimized for graph traversal, not, e.g., aggregate analysis of all the nodes
- top reasons to use it: problems with RDBMS join performance, evolving data set, domain shape is naturally a graph, open-ended business requirements
- Gartner’s 5 graphs: interest, intent. mobile, payment
Parquet
I didn’t take notes during those one (a drop of water from the bottom of my glass got under my Mac trackpad, and my mouse was going crazy for a while)
All the data and still not enough?
- No matter how much data you have, it’s never enough or never seems like the right type
- Predictive modeling – will someone default on a loan? Look at data for people who’ve had loans, and who defaulted and didn’t. Use data to make a predictive risk model
- IID = independent and identically distributed
Example IBM sales force optimization
- Can we predict where the opportunities are – which companies have growing IT budgets?
- Couldn’t know what was most important – where were these target companies spending their IT budget (not disclosed)
- Companies who are spending with us are probably representative of similar sized companies in the same market – use the “nearest neighbor” technique
- Compared model prediction to expert salesmen’s opinions, except for 15% of them, where the expert’s put the chances at zero. Why the difference? The model mis-identified some of the companies (no good way to cross-reference millions of internal customer records with independent sources)
Siemens – compter aided detection of breast cancer
- patient IDs ended up predicting odds for cancer. It turns out the ID was a proxy for location (whether they were at a treatment facility or a screening facility)
Display ad auctions – how do we decide who to target?
- multi-armed bandit – exploration vs exploitation
- what do we know? urls you’ve visited
- for something like advertising luxury cars, very few positive examples (people don’t buy them online)
- There is no correlation between ad clicks and purchases
- Better to look at – did the person eventually end up at the company’s home page at some point after seeing the ad?
- target people who the ad can actually influence (i.e. not people who already bought the product, or never will)
- but there’s no way to get data for that
- Counterfactuals – you can’t both show and not show someone and ad, and observe subsequent behavior. You have to either show it or not show it
- Ideally, build a predictive model for those who see the ad, and another model for those who don’t
- But the industry doesn’t do that – it’s all about conversion rate
Advertising fraud
- Malware on sites generating http requests
- Very difficult for ad auctions systems to detect
- Detect by looking at traffic between sites. Foe example, malware site womenshealthbase generates massive traffic to lots of other sites, not about womens health
- they make money by visiting a site with a real ad auction system. Then bid prices go up because of your traffic, which drives up ad revenue traffic on womenshealthbase
- Auction systems now put visitors from these sites in a penalty box, until they start displaying normal behavior again
What’s new with Apache Mahout?
- Amazon: customers who bought this item also bought this item – best known Mahout example
- Mahout implemented most of the algorithms in the Netflix recommendation contest
- In general, finds similarities between different groupings of things (clustering)
- Key features: classification, clustering, collaborative filtering
- Automatically tags questions on stack overflow
Uses
- recommend friends, products, etc
- classify content into groups
- find similar content
- find patterns in behavior
- etc – general solution to machine learning problems
[I’m leaving it most of the details about performance improvements and the roadmap for upcoming refinements – below are other interesting points]
- Often used with Hadoop, but it’s not necessary. Typically included in Hadoop distributions
- Streaming K-means – given centroids (points at the center of a cluster) determine which clusters other points belong in
- References: Mahout in Action (but a bit out of date), Taming Text http://mahout.apache.org
- Topic Modeling He wasn’t sure what the full feature set is – he’s pretty sure it doesn’t generate topic labels for you