7.29 Collective Intelligence in Realtime

Collective Intelligence is more widely employed than most Internet users realize.

Netflix lets people choose movies to be sent to their homes, and makes recommendations based on what customers have previously rented. Google watches visitor behavior, and employs complex algorithms to produce its PageRank. Amazon analyzes previous purchases, and emails customers with new titles of interest. The Hollywood Stock Exchange lets you buy and sell stocks at a price accurately set by trading behavior. Digg facilitates the sharing of Internet content by the collective votes of its users. Del.icio.us is a successful social bookmarking service that allows you to tag, save, manage and share web pages from a centralized source. Etc. But these are only the most obvious examples. Behind the scenes, complex algorithms are constantly being employed to:

1. Detect patterns of fraud in credit card transactions by neural networks and inductive logic.
2. Identify intruders in military installations by automatically analyzing video footage.
3. Group customer demands in product design and advertising.
4. Predict demand in supply chains and so minimize inventories.
5. Pinpoint opportunities in stock markets worldwide.
6. Minimize threats by analyzing the increasing data that government agencies hold on individuals.

How does this help the eretailer? In today's increasingly competitive environment it's imperative that companies:

1. Identify their better customers: those worth courting with special offers and products.
2. Ensure their goods and services are priced appropriately.
3. Discover and focus on their more profitable lines.
4. Anticipate customer reactions and have support staff properly prepared.
5. Make shopping a pleasant experience, treating each customer as a valued individual.
6. Keep abreast of their competitors — automatically, without tedious manual searches.

Implementation is highly technical, and this page only provides an introduction to the theory, approaches and programming of specific instances.

Theory

What is collective intelligence, and does it really work?

A broad, straightforward definition: 'Collective intelligence is any intelligence that arises from, or is a capacity or characteristic of, groups and other collective living systems'. Tom Atlee and George Pór (2007) defined intelligence as the ability to interact successfully with one's world, especially in the face of challenge or change. Human intelligence involves gathering, formulating, modifying, and applying effective knowledge — often in the form of ideas, images, sensations, patterns of response and sense-making — a process we refer to with words like learning, problem solving, planning, envisioning, intuition, understanding, creativity, etc. Anyone trying to create effective groups, organizations, institutions, healthy communities and sustainable societies soon discovers that individual intelligence is an insufficient factor in their success.

That groups are better at prediction is shown by focus groups and the stock market. Neither is free of bias or sudden changes of heart, but both give a better snapshot of political and business sentiment than the single pundit — and explain why it's so difficult to 'beat the market'.

MIT's Center for Collective Intelligence (CCI) is building systems to solve complex problems like climate change, cancer treatment, and IQ assessment, where no one person or group can be conversant with all of the issues. CCI researchers are exploring collective prediction, building on popular Internet sites where people can buy and sell predictions about the outcome of elections, sporting events, etc. Such web sites, based on the collective wisdom of their users, have proven remarkably accurate.

Though experience has been mixed, many institutions and businesses are pouring considerable funds into this field, which speaks for its potential value. Innocentive, which consults with 160,000 scientists and engineers, offering large cash prizes for innovative solutions, claims that as of 2006, 30% of the problems posted have been solved. Sermo is an association of 70,000 US physicians enabling members to post questions to fellow experts, and Collective Intellect summarizes viewpoints in blogs and other web pages for applications in finance and marketing.

Example One: Video Hire: Making Recommendations

If someone's been shopping at your estore for a while, it's not difficult to build a customer profile. The classic case is books: Amazon notes previous interests and emails the customer with new titles of possible interest. The methodology is proprietary, but Amazon clearly note the customer details, email address, credit-worthiness, the patterns of consumer spend and the fields of interest, sending an email when the field of interest under which they classify all their products matches that of a valued customer. A snippet of code (SQL request) searches the customer and product databases for a match, extracts name, email and recommendation, calls a script to wrap them into an email, and off it does. Any programmer can tell you more.

Less simple are partial matches. Your video store customer may like action movies, but he doesn't like them all. He wants the best, especially those that have been well-reviewed by critics or sites whose remarks he generally agrees with. To make recommendations in this case you'll need data for 1. comparisons and 2. some statistics.

Data Sources

There's a surprising amount of data readily available if you know where to look. There are your own server logs, whose data you can download in comma-delimited text files. AdSense and most traffic analytical programs also let you download the data from their sites, often in Excel-compatible or text forms. Much more extensive is the data that can be accessed from social bookmarking services like Del.icio.us and specialist sites. To do so, however, you'll have to obtain an API (application programming interface) from ProgrammableWeb and other sources, and write scripts to use them — access the data, extract it, organize and analyze it: i.e. computer programming.

Statistics

Anyone who pursues research in the life and natural sciences (and increasingly a good many other careers) will be familiar with the various ways of sorting and analyzing data: correlation, multivariate analysis, time series, etc. The formulas involved are daunting, but their derivation is the concern of statisticians. What the researcher does require is:

1. A broad understanding of statistical approaches,
2. The background to know which statistical methods apply in which sorts of problem dealing with what sorts of data,
3. How to acquire and use the various statistical packages now available for computer use,
4. How to interpret the results sensibly. If the data is in real time (i.e. obtained from their server logs as customer details come in, or from third-party social sites) then the research will also need programming skills to extract that information from web servers, obtain or write computer code for statistical analysis (most are available in computer language libraries), feed the data into the code, and use the results properly.

If, like most eretailers, your math ended in high-school, you'll need to do the following:

1. Clearly formulate what you want to do with your site,
2 a. Acquire some basic statistical knowledge: college courses, Internet sites or the local library, or
2 b. Read something like Toby Segaran's Programming Collective Intelligence (which provides a simple approach to the statistical concepts, data sources and much of the computer code in Python), or
2 c. Consult the better marketing companies who will have in-house statistical skills, and
3. Do a cost-benefit study of the work involved. Programming won't be cheap, and you do need to quantify the competitive advantages. 'Let's just try it and see', is feasible only for companies with a large R&D budget.

Assessing Recommendation

You'll start by asking your customers to rate the movie they rent: say awful (-2), poor (-1), OK (0) good (1) or fantastic (2). In time you'll want to compare ratings between your individual customers, but at first you'll be reliant on the recommendations on third-part sites, i.e. on Yahoo My Movies, Criticker, WhattoRent, Clerkdogs and the like.

A. Linking Customer to Third-Party Recommendations

Your next step is to assess the recommendations on third-party sites. You will:

1. Select the titles of your most popular video hires,
2. Collect the scores of these titles on your site given by your regular customers,
3. Find these titles on third-party sites and convert their recommendations into your sort of score [ i.e. into awful (-2), poor (-1), OK (0) good (1) or fantastic (2)],
4. Store details in a database.
4. By comparing scores, derive similarity weightings, customer to customer, film by film, for customers on those sites and yours.

B. Rating the Recommendations

1. Now you could simply look at the similarity scores and match customers across sites. If Betty Lewis, your best customer, rates her video titles pretty much as the Criticker site does, Critiker's ratings for new films would apply to Betty too. You could use their hot ratings to select other films for her, and email her with titles as they become available.

2. But that's a little unreliable, as everyone has odd quirks. You need to 'average' the similarity weightings by considering several films in the same category and the ratings on similar sites. That 'averaging' will employ some clustering or nearest neighbor approach, and you'll need to use some filtering device to reduce the computational effort involved in comparing everything with everything else.

Details

No code or statistical reasoning is provided here, but readers can be assured that:

1. The statistical methods are well known: they are explained in books and on Internet sites, with the necessary formulas given.
2. Most of the code, for extracting the information and deploying statistical assessments, is provided in Python by Toby Segaran book (see below)
3. Code libraries exist for handling data and statistical analysis in most computing languages: you don't have to reinvent the wheel by writing your own.
4. A halfway house exists. Before plunging into programming you can extract data and experiment using various statistical packages, some free.
5. Data can be extracted automatically from third-party sites (and then processed entirely by computer) by using a free API (application programming interface), e.g. that supplied by Netflix.

Example Two: Travel Planning

In this fictitious example, you're a large travel company with dozens of tours starting every day in different parts of the world. Every day you've got to get the participants to the rendezvous points, arranging their flights in the cheapest and most convenient way for them. Yes, you can spend hours on the phone to carriers and booking clerks — and will probably have to anyway, since there's always someone who messes up the best-laid plans — but ideally, you'll want to automate the process as much as possible.

You need an optimizing algorithm, and will probably start by expressing all possibilities as some cost function, this being the air fare plus some monetary weighting for travel time, time spent waiting for connections, and the inconvenience of early morning flights. Then the methods open to you are:

1. Consecutive searching. You'll feed all the possible itineraries into a database and devise a program that calculates the costs in every case. Each possibility for every member of the tour party will have to be compared with every possibility for every other member, and the set of bookings chosen that minimizes the total party cost. With a large party, the approach will involve hundreds or thousands of iterations , and probably leave some hard cases (two early starts, a 6 hour wait for someone else, etc.), but the principle is easily grasped. Practicable, but not efficient with computer time.
2. Hill climbing. Here the program takes a random schedule and looks at neighboring itineraries to find one that is cheaper. That schedule is treated the same way, and so on, until no cheaper schedules are found. To avoid being caught in a local situation unrepresentative of the whole, you'll need to repeat the exercise several times, starting at a different points randomly selected. Not so easily envisaged, but more efficient.
3. Simulated annealing. The process starts with a complete set of itineraries chosen at random. Then one of member's itineraries is changed and costs compared. If there's a reduction, then that change is adopted, and another member's itinerary is changed. Iteration continues until the algorithm will only accept better solutions, and finally the best. Similar to 2, more iterative, but avoids getting stuck in a local situation.
4. Genetic algorithms. You start by creating a set of random itineraries called a current population, which is then costed. You then create another population, called the next generation, and add the top solutions of the current population to it. Next, you modify members of next generation by one of two methods. The first is called mutation: you simply modify one of the members of the next generation. The second is called breeding: you take two of the best solutions and combine them in some way. Then you repeat the process and another population is created, again repeating the process for a fixed number of iterations, or until no more improvement is obtained. The best solution (cheapest set of itineraries) is the one chosen.

Code for all these algorithms can be found in programming libraries and tutorials.

How do you get the airfares and schedules in the first place?

If you have an online booking service, then you'll have access to this flight information.

If not, you can use a vertical search engine like Kayak, accessing through an API (application programming interface: you sign up for a developer key with Kayak). Then you'll write some code, which in Python (other languages are possible, but Python has libraries of functions already written) that extracts the information you need to provide realtime solutions.

Which Statistical Approach?

Statistical approaches are not exclusive, and you'll often find yourself using several methods to unlock the significant details of the information you've collected. At least to start with, however, it may help to view problems in this way:

1. Have masses of data? Don't need to know the significant factors? Use neural networks.
2. Have less data? Important to know the relevant factors? Use regression analysis.
3. Need to distinguish groupings in a mass of data? Use cluster analysis.
4. Need to find the nearest representative? Use kNN analysis.

Resources

1. Statistics on the Web. Clavius Web. Clay Helberg's useful listing of sites.
2. Pitfalls of Data Analysis by Clay Helberg. June 1995. Clavius Web. June 1995.
3. Information Theory, Inference, and Learning Algorithms by David MacKay. David MacKay. September 2003.
4. Free Statistics. Free Statistics. Good listing of open source and freeware statistics packages.
5. Statistical Analysis Software Survey. LionHRTPub. Useful tables if you're familiar with statistics packages.
6. Python Resources in One Place. Codes for many applications.
7. Java Programming Resources. Tutorials, compiler and resources.
8. CPAN. Comprehensive Perl Archive Network.
9. Innocentive. Offers a marketplace where 160,00 engineers and scientists cooperate to solve problems.
10. YourEncore. Offers a network of retired and veteran scientists and engineers providing our clients with proven experience.
11. Statistics. Wikibooks. Extensive sets of articles, not all complete.

Questions

1. What is meant by realtime systems? How are programming expenses justified?
2. Give three examples of realtime systems, and their commercial advantages.
3. You've been asked to design the logical system of realtime video hire company. Describe the steps you would take.
4. You're presenting a consultant's plan for a realtime travel company startup. What approaches are possible, and where would the company get its realtime data from?

Sources and Further Reading

1. Algorithms of the Intelligent Web by Haralambos Marmanis and Dmitry Babenko. Manning Publications. June 2009. Specimen code in Java.
2. Programming Collective Intelligence: Building Smart Web 2.0 Applications by Toby Segaran. O'Reilly. August 2007. Includes specimen code in Python.
3. You're Leaving a Digital Trail. What About Privacy? by John Markoff. NYT. November 2008.. Article suggesting the numerous applications of CI.
4. Putting heads (and computers) together to solve global problems by Anne Trafton. MIT. January 2009.
5. Collective intelligence. Wikipedia. With examples and a short listing of sites.
6. Handbook of Collective Intelligence. MIT. Detailed, Wikipedia-like entry on MIT site, with good theory and examples.
7. Blog of Collective Intelligence. George Pór's: many useful posts by an expert.