Divolte Collector is a solution for collecting clickstream data from website users and subsequently store that data into Hadoop as Avro files and push individual click events as messages onto a Kafka topic. This is useful in scenario’s where you need to perform off line and/or near real-time processing and analysis on user behavior in the form of click event data, such as when building recommender systems, personalization of websites or plain analytics on user behavior on a website.
Several solutions exist already to this problem (hosted solution like Google Analytics, Omniture, or open source systems such as Piwik). Here’s what makes Divolte Collector different:
The remainder of this chapter introduces the concept of clickstream collection and the way Divolte Collector solves this problem. If you’d rather get hands-on immediately, why not jump to the Getting Started guide.
Click events are the primary data source in web optimization solutions. These events are often captured through log (file) processing and pushed down to systems for both batch and near real-time processing. So, capturing the click event data from web visitors in order to analyze and act upon user behavior is not a new problem. Over time, it has been solved in different ways, all with their own advantages and drawbacks. First, let’s look at a typical architecture for building web optimization solutions:
In the above diagram, the part marked in the red outline is the problem that Divolte Collector focusses on: the collection, parsing, storing and streaming of click events.
The simplest solution to this problem (and also the one that most of the early Hadoop use cases were based on), is to simply collect the web server log files and push them onto HDFS for further processing:
This method has some drawbacks, however:
In order to overcome some of these problems, the next generation of log event collection setups would stream individual log events instead of moving entire log files around. This can be solved using a combination of tools like syslog, syslog-ng and Apache Flume.
While this adds streaming processing to the mix, some drawbacks still exist: there is no schema, you still need a parser and it suffers from bots and crawlers alike. That’s why, many of the more modern solutions take a different approach. Instead of using the server side log event as the source of event data, a event is generated on the client side, which actively calls a separate back-end serivce to handle the event logging. This method is often called tagging (or Web Bug, if you’re Wikipedia). In this scenario, each web page contains a special piece of JavaScript code that calls a back-end service to generate the actual event:
This approach has several benefits:
This is the way Divolte Collector works. In the above diagram, the components in the red outlines are the core parts of Divolte Collector: a JavaScript tag that goes in each web page and a collection server that enriches events and writes them as Avro records to HDFS files and Kafka messages.
In addition to collecting click events, Divolte Collector provides a number of welcome features:
Divolte Collector is written in pure Java and runs on any OS that supports the latest JVM. For best result, we recommend running on Linux.
Note: Divolte Collector can be configured to send data to either HDFS or Kafka or both. It’s not required to use both. When running locally for testing, it can also write to the local file system.