Mapping in Divolte Collector is the definition that determines how incoming events are translated into Avro records conforming to a schema. This definition is constructed using a Groovy-based DSL (Domain-Specific Language).
Most clickstream data collection services or solutions use a canonical data model that is specific to click events and related properties. Things such as location, referrer, remote IP address, path, etc. are all properties of a click event that come to mind. While Divolte Collector exposes all of these fields just as well, it is our vision that this is not enough to make it easy to build online and near real-time data driven products within specific domains and environments. For example, when working on a system for product recommendation, the notion of a URL or path for a specific page is completely in the wrong domain; what you care about in this case is likely a product ID and probably a type of interaction (e.g. product page view, large product photo view, add to basket, etc.). It is usually possible to extract these pieces of information from the clickstream representation, which means custom parsers have to be created to parse this information out of URLs, custom events from JavaScript and other sources. This means that whenever you work with the clickstream data, you have to run these custom parsers initially in order to get meaninful, domain specific information from the data. When building real-time systems, it normally means that this parser has to run in multiple locations: as part of the off line processing jobs and as part of the real-time processing.
With Divolte Collector, instead of writing parsers and working with the raw clickstream event data in your processing, you define mappings that allows Divolte Collector to do all the required parsing on the fly as events come in and subsequently produce structured records with a schema to use in further processing. This means that all data that comes in can already have the relevant domain specific fields populated. Whenever the need for a new extracted piece of information arises, you can update the mapping to include the new field in the newly produced data. The older data that lacks newly additional fields can co-exist with newer data that does have the additional fields through a process called schema evolution. This is supported by Avro’s ability to read data with a different schema from the one that the data was written with. (This is implemented at read-time using a process called schema resolution.)
The goal of the mapping is to get rid of log file or URL parsing on collected data after it is published. The event stream from Divolte Collector should have all the domain specific fields to support you use cases directly.
Before you dive in to creating your own mappings, it is important to understand a little bit about how a mapping is actually performed. The most notable thing to keep in mind is that a mapping script that you provide is not evaluated at request time for each event. Instead a mapping is evaluated only once during startup and declares how the actual mapping should take place.
Schema evolution is the process of changing the schema over time as requirements change. For example when a new feature is added to your website, you add additional fields to the schema that contain specific information about user interactions with this new feature. In this scenario, you would update the schema to have these additional fields, update the mapping and then run Divolte Collector with the new schema and mapping. This means that there will be a difference between data that was written prior to the update and data that is written after the update. Also, it means that after the update, there can still be consumers of the data (from HDFS or Kafka) that still use the old schema. In order to make sure that this isn’t a problem, the readers with the old schema need to be able to read data written with the new schema and readers with the new schema should also still work on data written with the old schema.
Luckily, Avro supports both of these cases. When reading newer data with an older schema, the fields that are not present in the old schema are simply ignored by the reader. The other way around is slightly trickier. When reading older data with a new schema, Avro will fill in the default values for fields that are present in the schema but not in the data. This is provided that there is a default value. Basically, this means that it is recommended to always provide a default value for all your fields in the schema. In case of nullable fields, the default value could just be null.
One other reason to always provide a default value is that Avro does not allow to create records with missing values if there are no default values. As a result of this, fields that have no default value always must be populated in the mapping, otherwise an error will occur. This is problematic if the mapping for some reason fails to set a field (e.g. because of a user typing in a non-conforming location in the browser).
In addition to introducing new fields with defaults, other forms of changes such as renaming and type changes can be permitted under some circumstances. For full details on the changes that are permitted and how the writing and reading schemas are reconciled refer to the Avro documentation on schema resolution.
Mappings are specified by Groovy scripts that are compiled and run by Divolte Collector on startup. Each mapping script is written in the mapping DSL. The result of running this script is a mapping that Divolte Collector can use to map incoming events from its configured sources onto an Avro schema.
The values that are available on an event depend on the type of source that produced it. Currently browser events make more values available than events produced by JSON sources.
Divolte Collector comes with a built-in default schema and mapping. A mapping will use these if the mapping schema or script file are not specified. The default mapping is intended to map events from a browser source and will map most things that you would expect from a clickstream data collector. The Avro schema that is used can be found in the divolte-schema Github repository. The following mappings are present in the default mapping:
Mapped value | Avro schema field |
---|---|
duplicate | detectedDuplicate |
corrupt | detectedCorruption |
firstInSession | firstInSession |
timestamp | timestamp |
clientTimestamp | clientTimestamp |
remoteHost | remoteHost |
referer | referer |
location | location |
viewportPixelWidth | viewportPixelWidth |
viewportPixelHeight | viewportPixelHeight |
screenPixelWidth | screenPixelWidth |
screenPixelHeight | screenPixelHeight |
partyId | partyId |
sessionId | sessionId |
pageViewId | pageViewId |
eventType | eventType |
userAgentString | userAgentString |
User agent name | userAgentName |
User agent family | userAgentFamily |
User agent vendor | userAgentVendor |
User agent type | userAgentType |
User agent version | userAgentVersion |
User agent device category | userAgentDeviceCategory |
User agent OS family | userAgentOsFamily |
User agent OS version | userAgentOsVersion |
User agent OS vendor | userAgentOsVendor |
The default schema is not available as a mapping script. Instead, it is hard coded into Divolte Collector. This allows Divolte Collector to do something useful out-of-the-box without any complex configuration.
Mapping involves three main concepts: values, fields and mappings.
A value is something that is extracted from the incoming event (e.g. the location or a HTTP header value) or is derived from another value (e.g. a query parameter from the location URI). Values in the mapping are produced using calls to functions that are built into the mapping DSL. Below is the complete documentation for all values that can be produced. One example of such a function call would be calling location() for the location value or referer() for the referrer value of the event.
A field is a field in the Avro record that will be produced as a result of the mapping process. The type of a field is defined by the Avro schema that is used. Mapping is the process of mapping values extracted from the event onto fields in the Avro record.
A mapping is the piece that tells Divolte Collector which values need to be mapped onto which fields. The mapping DSL has a built in construct for this, explained below.
The simplest possible mapping is mapping a simple value onto a schema field. The syntax is as follows:
map location() onto 'locationField'
Alternatively, the map function takes a closure as first argument, which can come in handy when the value is the result of several operations or a more complex construct, such as this example where we take a query parameter from the location and parse it as an integer:
map {
def u = parse location() to uri // Parse the URI out of the location
parse u.query().value('n') to int32 // Take the n query parameter and try to parse an int out of it
} onto 'intField'
In Groovy the last statement in a closure becomes the return value for the closure. So in the closure above, the value returned by the parse call is the result of the entire closure. This is in turn mapped onto the intField field of the Avro record.
Apart from mapping values onto fields, it is also possible to map a literal onto a field:
map 'string literal' onto 'stringField'
map true onto 'booleanField'
This is most often used in combination with Conditional mapping (when) as in this example:
when referer().isAbsent() apply { // Only apply this mapping when a referer is absent
map true onto 'directTraffic'
}
Not all values are present in each event. For example, when using a custom cookie value there could be incoming events where the cookie is not sent by the client. In this case the cookie value is said to absent. Similarly, events from a JSON source do not have a location value; this is specific to events from a browser source.
Divolte Collector will never actively set an absent value. Instead for absent values it does nothing at all: the mapped field is not set on the Avro record. When values that are absent are used in subsequent expressions the derived values will also be absent. In the following example the intField field will never be set because the incoming request has no referrer. This is not an error:
def u = parse referer() to uri // parse a URI out of the referer
def q = u.query() // parse the query string of the URI
def i = parse q.value('foo') to int32 // parse a int out of the query parameter 'foo'
map i onto 'intField' // map it onto the field 'intField'
Because absent values result in fields not being set your schema must have default values for all fields that are used for mappings where the value can be absent. In practice, it is recommended to always use default values for all fields in your schema.
Values in a mapping are typed and the value type must match the type of the Avro field that they are mapped onto. Divolte Collector checks for type compatibility during startup and will report an error if there is a mismatch. The type for a value can be found in the documentation below.
Below is a table of all types that can be produced in a mapping and the corresponding Avro types that match them:
Type | Avro type |
---|---|
String | { "name": "fieldName", "type": ["null","string"], "default": null }
|
Boolean | { "name": "fieldName", "type": ["null","boolean"], "default": null }
|
int | { "name": "fieldName", "type": ["null","int"], "default": null }
|
long | { "name": "fieldName", "type": ["null","long"], "default": null }
|
float | { "name": "fieldName", "type": ["null","float"], "default": null }
|
double | { "name": "fieldName", "type": ["null","double"], "default": null }
|
Map<String,List<String>> | {
"name": "fieldName",
"type": [
"null",
{
"type": "map",
"values": {
"type": "array",
"items": "string"
}
}
],
"default": null
}
|
List<String> | {
"name": "fieldName",
"type":
[
"null",
{
"type": "array",
"items": "int"
}
],
"default": null
}
|
JSON (JsonNode) | Must match the structure of the JSON fragment. See Mapping JSON (JsonNode) to Avro fields. |
Many of the simple values that can be extracted from an event are strings. Sometimes these values are not intended to be strings. Because type information about things like query parameters or path components is not present in a HTTP request, Divolte Collector can only treat these values as strings. It is, however, possible to parse a string to a primitive or other type in the mapping using this construct:
def i = parse stringValue to int32
In the example above, stringValue is a string value and the result value, assigned to i, will be of type int.
Note
This is not casting, but string parsing. If the string value cannot be parsed to an integer (because it is not a number) the resulting value will be absent, but no error occurs.
A more complete example is this:
def u = parse referer() to uri // u is of type URI (which is not mappable)
def q = u.query() // q is of type map<string,list<string>>
def s = q.value('foo') // s is of type string if query parameter foo contained a integer number
def i = parse s to int32 // i is of type int
map i onto 'intField' // map it onto the field 'intField'
Because int, long, Boolean, etc. are reserved words in Groovy, the mapping DSL uses aliases for parsing. The following table lists the types that can be used for parsing and the corresponding mapping types:
Parsing alias | Type |
---|---|
int32 | int |
int64 | long |
fp32 | float |
fp64 | double |
bool | Boolean |
uri | URI |
Some expressions, for example, eventParameters() (and its path() method), produce a JsonNode value that represents JSON supplied by a client. Because Avro doesn’t have a type for handling arbitrary JSON data, a compatible Avro type must be chosen to match the expected structure of the JSON from the client. The following table lists the rules for compatibility between JSON values and Avro types.
Avro type | JSON value |
---|---|
null
|
JSON’s null value |
boolean
|
A JSON boolean, or a string if it can be parsed as a boolean. |
int
long
|
A JSON number, or a string if it can be parsed as a number. Fractional components are truncated for float and double. |
float
double
|
A JSON number, or a string if it can be parsed as a number. Note that full floating-point precision may not be preserved. |
bytes
|
A JSON string, with BASE64 encoded binary data. |
string
|
A JSON string, number or boolean value. |
enum
|
A JSON string, so long as the it’s identical to one of the enumeration’s symbols. (If not, the value will be treated as null. |
record
|
A JSON object, with each property corresponding to a field in the record. (Extraneous properties are ignored.) The property values and field types must also be compatible. |
array
|
A JSON array. Each element of the JSON array must be compatible with the type declared for the Avro array. |
map
|
A JSON object, with each property being an entry in the map. Property names are used for keys, and the values must be compatible with the Avro type for the map values. |
union
|
Only trivial unions are supported of null with another type. The JSON value must either be null or compatible with the other union type. |
fixed
|
The same as bytes, as above. Data beyond the declared length will be truncated. |
In addition to these compatibility rules, trivial array wrapping and unwrapping will be performed if necessary:
For example, a shopping basket could be supplied as the following JSON:
{
"total_price": 184.91,
"items": [
{ "sku": "0886974140818", "count": 1, "price_per": 43.94 },
{ "sku": "0094638246817", "count": 1, "price_per": 22.99 },
{ "sku": "0093624979357", "count": 1, "price_per": 27.99 },
{ "sku": "8712837825207", "count": 1, "price_per": 89.99 }
]
}
This could be mapped using the following Avro schema:
{
"type": [
"null",
{
"name": "ShoppingBasket",
"type": "record",
"fields": [
{ "name": "total_price", "type": "float" },
{
"name": "items",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "LineItem",
"fields": [
{ "name": "sku", "type": "string" },
{ "name": "count", "type": "int" },
{ "name": "price_per", "type": "double" }
]
}
}
}
]
}
],
"default": null
}
The Avro field will remain unchanged if mapping fails at runtime because the JSON value cannot be mapped onto the specified Avro type. (The complete record may subsequently be invalid if the field was mandatory.)
Note
Unlike most mappings, schema compatibility for JSON mappings cannot be checked on startup because compatibility depends on the JSON supplied with each individual event.
Not all incoming requests are the same and usually, different types of requests require different values to be extracted and different fields to be set. This can be achieved using conditional mapping. With conditional mapping any boolean value can be used to conditionally apply a part of the mapping script. This can be done using the following syntax:
when conditionBooleanValue apply {
// Conditional mapping go here
map 'value' onto 'fieldName'
}
A more concrete example of using this construct would be:
when referer().isAbsent() apply {
map true onto 'directTraffic'
}
Here we check whether the referer value is absent and if so, map a literal value onto a boolean field.
As an alternative syntax, it is possible to use a closure that produces the boolean value as well, just like in Mapping values onto fields (map). In this example we check if a query parameter called clientId is present in the location and on that condition perform a mapping:
when {
def u = parse location() to uri
u.query().value('clientId').isPresent()
} apply {
map true onto 'signedInUser'
}
Any boolean value can be used as a condition. In order to be able to create flexible conditional mappings, the mapping DSL provides a number of methods on values that return booleans useful in conditional mappings, such as equality comparisons and boolean logic:
Condition | Description |
---|---|
value.isPresent() | True if the value is present. See: Value presence |
value.isAbsent() | True if the value is absent. See: Value presence |
value.equalTo(otherValue) | True if both values are equal. Values must be of the same type. |
value.equalTo('literal') | True if the value is equal to the given literal. Non-string types are supported as well. |
booleanValue.and(otherBooleanValue) | True if both booleans are true. |
booleanValue.or(otherBooleanValue) | True if either or both of the boolean values are true. |
not booleanValue | True if the boolean value is false. |
regexMatcherValue.matches() | True if the regular expression matches the value. See: Regular expression matching. |
Sections are useful for grouping together parts of the mapping that form a logical subset of the entire mapping. In addition to grouping it is possible to conditionally stop processing a section prematurely. Sections are defined using the section keyword followed by a closure that contains the section:
section {
// Section's mappings go here
map 'value' onto 'field'
}
The exit() function will, at any point, break out of the enclosing section or, when no enclosing section can be found, break out of the entire mapping script. This can be used to conditionally break out of a section. For example to create a type of first-match-wins scenario:
section {
def u = parse location() to uri
when u.path().equalTo('/home.html') apply {
map 'homepage' onto 'pageType'
exit()
}
when u.path().equalTo('/contact.html') apply {
map 'contactpage' onto 'pageType'
exit()
}
map 'other' onto 'pageType'
}
// other mappings here
There is a optional shorthand syntax for conditionally exiting from a section which leaves out the apply keyword and closure:
when referer().isAbsent() exit()
The stop() function will, at any point, stop all further processing and break out of the entire mapping script. This is typically applied conditionally. Generally, it is safer to use sections and exit() instead. Use with care. The stop() function can also be used conditionally, just as anything else:
when referer().isAbsent() {
stop()
}
Or, using shorthand syntax:
when referer().isAbsent stop()
Groovy is a dynamic language for the JVM. This means, amongst other things, that you don’t have to specify the types of variables:
def i = 40
println i + 2
The above snippet will print out 42 as you would expect. Note two things: we never specified that variable i is an int and also, we are not using any parentheses in the println function call. Groovy allows to leave out the parentheses in most function and method calls. The code above is equivalent to this snippet:
def i = 42
println(i + 2)
This in turn is equivalent to this:
def i = 42
println(i.plus(2))
This works well when chaining single argument methods. However, this can be more problematic with nested method calls. Suppose we have a function called increment(x) which increments the x argument by 1, so increment(10) will return 11. The following will not compile:
println increment 10
However this will:
println(increment(10))
Yet this won’t:
println(increment 10)
In the Divolte Collector mapping DSL, it is sometimes required to chain method calls. For example when using the result of a casting operation in a mapping. We solve this by accepting a closure that produces a value as result:
map { parse cookie('customer_id') to int32 } onto 'customerId'
This way you don’t have to add parentheses to all intermediate method calls and we keep the syntax fluent. If you follow these general guidelines, you should be safe:
When calling methods that produce a value, always use parentheses. For example: location(), referer(), partyId()
When deriving a condition or other value from a method that produces a value, also use parentheses. For example:
when location().equalTo('http://www.example.com/') apply {
...
}
map cookie('example').isPresent() onto 'field'
map parsedUri.query().value('foo') onto 'field'
When parsing or matching on something, extract it to a variable before using it. This also improves readability:
def myUri = parse location() to uri
when myUri.query().value('foo').isPresent() apply { ... }
def myMatcher = match '^/foo/bar/([a-z]+)/' against myUri.path()
when myMatcher.matches() apply { ... }
When casting inline, use the closure syntax for mapping or conditionals:
map { parse cookie('example') to int32 } onto 'field'
Simple values are pieces of information that are directly extracted from the event without any processing. You can map simple values directly onto fields of the correct type or you can use them in further processing, such as matching againast a regular expression or URI parsing.
Usage: | map location() onto 'locationField'
|
---|---|
Sources: | browser |
Description: | The location URL of the page where the event was triggered: the full address in the address bar of the user’s browser. This includes the fragment part if this is present (the part after the #), which is different from server side request logs which do not contain the fragment part. |
Type: | string |
Usage: | map referer() onto 'refererField'
|
---|---|
Sources: | browser |
Description: | The referrer URL for the page-view that triggered the event. Unlike location(), the referer will not contain any fragment part. |
Type: | String |
Usage: | map firstInSession() onto 'first'
|
---|---|
Sources: | browser, json |
Description: | A boolean flag that is true if a new session ID was generated for this event and false otherwise. If true a new session has started. |
Type: | Boolean |
Usage: | map corrupt() onto 'detectedCorruption'
|
---|---|
Sources: | browser, json |
Description: | A boolean flag that is true if the source for the event detected corruption of the event data. Event corruption usually occurs when intermediate parties try to re-write HTTP requests or truncate long URLs. Real-world proxies and anti-virus software has been observed doing this. Note that although this field is available on events from all sources, only browser sources currently detect corruption and set this value accordingly. |
Type: | Boolean |
Usage: | map duplicate() onto 'detectedDuplicate'
|
---|---|
Sources: | browser, json |
Description: | A boolean flag that true when the event is believed to be a duplicate of an earlier one. Duplicate detection in Divolte Collector utilizes a probabilistic data structure that has a low false positive and false negative rate. Nonetheless classification mistakes can still occur. Duplicate events often arrive due to certain types of anti-virus software and certain proxies. Additionally, browsers sometimes go haywire and send the same request large numbers of times (in the tens of thousands). Duplicate detection can be used to mitigate the effects when this occurs. This is particularly handy in real-time processing where it is not practical to perform de-duplication of the data based on a full data scan. |
Type: | Boolean |
Usage: | map timestamp() onto 'timeField'
|
---|---|
Sources: | browser, json |
Description: | The timestamp of the time the the request was received by the server, in milliseconds since the UNIX epoch. |
Type: | long |
Usage: | map clientTimestamp() onto 'timeField'
|
---|---|
Sources: | browser, json |
Description: | The timestamp that was recorded on the client side immediately prior to sending the request, in milliseconds since the UNIX epoch. |
Type: | long |
Usage: | map remoteHost() onto 'ipAddressField'
|
---|---|
Sources: | browser, json |
Description: | The remote IP address of the request. Depending on configuration, Divolte Collector will use any X-Forwarded-For headers set by intermediate proxies or load balancers. |
Type: | String |
Usage: | map viewportPixelWidth() onto 'widthField'
|
---|---|
Sources: | browser |
Description: | The width of the client’s browser viewport in pixels. |
Type: | int |
Usage: | map viewportPixelHeight() onto 'widthField'
|
---|---|
Sources: | browser |
Description: | The height of the client’s browser viewport in pixels. |
Type: | int |
Usage: | map screenPixelWidth() onto 'widthField'
|
---|---|
Sources: | browser |
Description: | The width of the client’s screen in pixels. |
Type: | int |
Usage: | map screenPixelHeight() onto 'widthField'
|
---|---|
Sources: | browser |
Description: | The height of the client’s screen in pixels. |
Type: | int |
Usage: | map devicePixelRatio() onto 'ratioField'
|
---|---|
Sources: | browser |
Description: | The ratio of physical pixels to logical pixels on the client’s device. Some devices use a scaled resolution, meaning that the resolution and the actual available pixels are different. This is common on retina-type displays, with very high pixel density. |
Type: | int |
Usage: | map partyId() onto 'partyField'
|
---|---|
Sources: | browser, json |
Description: | A long-lived unique identifier stored by a client that is associated with each event they send. All events from the same client should have the same party identifier. For browser sources this value is stored in a cookie. |
Type: | String |
Usage: | map sessionId() onto 'sessionField'
|
---|---|
Sources: | browser, json |
Description: | A short-lived unique identifier stored by a client that is associated with each event from that source within a session of activity. All events from the same client within a session should have the same session identifier. For browser sources a session to expire when 30 minutes has elapsed without any events occurring. |
Type: | String |
Usage: | map pageViewId() onto 'pageviewField'
|
---|---|
Sources: | browser |
Description: | A unique identifier that is generated for each page-view. All events from a client within the same page-view will have the same page-view identifier. For browser sources a page-view starts when the user visits a page, and ends when the user navigates to a new page. Note that navigating within single-page web applications or links to anchors within the same page do not normally trigger a new page-view. |
Type: | String |
Usage: | map eventId() onto 'eventField'
|
---|---|
Sources: | browser, json |
Description: | A unique identifier that is associated with each event received from a source. (This identifier is assigned by the client, not by the server.) |
Type: | String |
Usage: | map userAgentString() onto 'uaField'
|
---|---|
Sources: | browser, json |
Description: | The full user agent identification string reported by the client HTTP headers when sending an event. See User agent parsing on how to extract more meaningful information from this string. |
Type: | String |
Usage: | map cookie('cookie_name') onto 'customCookieField'
|
---|---|
Sources: | browser, json |
Description: | The value of a cookie included in the client HTTP headers when sending an event. |
Type: | String |
Usage: | map eventType() onto 'eventTypeField'
|
---|---|
Sources: | browser, json |
Description: | The type of event being processed. The tracking tag used by sites integrating with browser sources automatically issue a pageView event by default when a page-view commences. Custom events may set this value to anything they like. |
Type: | String |
Complex values often return intermediate objects that you extract derived, simple values for mapping onto fields. The main exception to this is when working with event-parameters: the JsonNode results can be mapped directly to fields, so long as they are of the right ‘shape’; see Mapping JSON (JsonNode) to Avro fields for more details.
Usage: | When submitting custom events from a client:
In the mapping: map eventParameters() onto 'parametersField'
|
||||
---|---|---|---|---|---|
Sources: | browser, json |
||||
Description: | A JSON object or array (JsonNode) containing the custom parameters that were submitted with the event. See Mapping JSON (JsonNode) to Avro fields for an example on how to map this to a field. |
||||
Type: | JsonNode |
Usage: | When submitting custom events from a client:
In the mapping: map eventParameters().value('foo') onto 'fooField'
// Or with a cast:
map { parse eventParameters().value('bar') to int32 } onto 'barField'
|
||||
---|---|---|---|---|---|
Description: | The value for an event parameter that was sent as part of a custom event. Note that this is always a string, regardless of the type used on the client side. If you are certain a parameter has a specific format you can explicitly cast it as in the example above. |
||||
Type: | String |
Usage: | When submitting custom events from a client:
In the Avro schema: {
"name": "searchResults",
"type": [ "null", { "type": "array", "items": "string" } ],
"default": null
}
In the mapping: map eventParameters().path('$[*].sku') onto 'searchResults'
|
||||
---|---|---|---|---|---|
Description: | This can be used to extract parts of parameters supplied with the event using a JSON-path expression. (See http://goessner.net/articles/JsonPath/ for a description of JSON-path expressions.) If the expression does not match anything, the value is not considered to be present. (A when expression can test for this.) See Mapping JSON (JsonNode) to Avro fields for an example on how to map JSON values to a field. Expressions can return more than one result; these are presented as a JSON array for subsequent mapping. |
||||
Type: | JsonNode |
Usage: | def locationUri = parse location() to uri
|
---|---|
Description: | Attempts to parse a string as a URI. The most obvious candidates to use for this are the location() and referer() values, but you can equally do this same with custom event parameters or any other string value. If the parser fails to create a URI from a string, then the value will be absent. Note that the parsed URI itself is not directly mappable onto any Avro field. |
Type: | URI |
Usage: | def locationUri = parse location() to uri
map locationUri.path() onto 'locationPathField'
|
---|---|
Description: | The path component of a URI. Any URL encoded values in the path will be decoded. Keep in mind that if the path contains a encoded / character (%2F), this will also be decoded. Be careful when matching regular expressions against path parameters. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.rawPath() onto 'locationPathField'
|
---|---|
Description: | The path component of a URI. This value is not decoded in any way. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.scheme() onto 'locationSchemeField'
// or check for HTTPS and map onto a boolean field
map locationUri.scheme().equalTo('https') onto 'isSecure'
|
---|---|
Description: | The scheme component of a URI. This is the protocol part, such as http or https. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.host() onto 'locationHostField'
|
---|---|
Description: | The host component of a URI. For http://www.example.com/foo/bar this would be www.example.com. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.port() onto 'locationPortField'
|
---|---|
Description: | The port component of a URI. For http://www.example.com:8080/foo this would be 8080. Note that when no port is specified in the URI (e.g. http://www.example.com/foo) this value will be absent. Divolte Collector makes no assumptions about default ports for protocols. |
Type: | int |
Usage: | def locationUri = parse location() to uri
map locationUri.decodedQueryString() onto 'locationQS'
|
---|---|
Description: | The full, URL decoded query string of a URI. For http://www.example.com/foo/bar.html?q=hello+world&foo%2Fbar, this would be q=hello world&foo/bar. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.rawQueryString() onto 'locationQS'
|
---|---|
Description: | The full, query string of a URI without any decoding. For http://www.example.com/foo/bar.html?q=hello+world&foo%2Fbar this would be q=hello+world&foo%2Fbar. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.decodedFragment() onto 'locationFragment'
|
---|---|
Description: | The full, URL decoded fragment of a URI. For http://www.example.com/foo/#/localpath/?q=hello+world&foo%2Fbar this would be /localpath/?q=hello world&foo/bar. |
Type: | String |
Usage: | def locationUri = parse location() to uri
map locationUri.rawFragment() onto 'locationFragment'
|
---|---|
Description: | The full, fragment of a URI without any decoding. For http://www.example.com/foo/#/localpath/?q=hello+world&foo%2Fbar this would be /localpath/?q=hello+world&foo%2Fbar. In web applications with rich client side functionality written in JavaScript, it is a common pattern that the fragment of the location is written as a URI again, but without a scheme, host and port. Nonetheless, it is entirely possible to parse the raw fragment of a location into a separate URI again and use this for further mapping. As an example, consider the following: // If location() = 'http://www.example.com/foo/#/local/path/?q=hello+world'
// this would map '/local/path/' onto the field clientSidePath
def locationUri = parse location() to uri
def localUri = parse location().rawFragment() to uri
map localUri.path() onto 'clientSidePath'
|
Type: | String |
Usage: | def locationUri = parse location() to uri
def locationQuery = locationUri.query()
map locationQuery onto 'locationQueryParameters'
|
---|---|
Description: | The query string from a URI parsed into a map of value lists. In the resulting map, the keys are the parameter names of the query string and the values are lists of strings. Lists are required because a query parameter can have multiple values (by being present more than once). In order to map all the query parameters directly onto a Avro field, the field must be typed as a map of string lists, possibly a union with null, to have a sensible default when no query string is possible. In a Avro schema definition, the following field definition can be a target field for the query parameters: {
"name": "uriQuery",
"type": [
"null",
{
"type": "map",
"values": {
"type": "array",
"items": "string"
}
}
],
"default": null
}
|
Type: | Map<String,List<String>> |
Usage: | def locationUri = parse location() to uri
def locationQuery = locationUri.query()
map locationQuery.value('foo') onto 'fooQueryParameter'
|
---|---|
Description: | The first value found for a query parameter. This value is URL decoded. |
Type: | String |
Usage: | def locationUri = parse location() to uri
def locationQuery = locationUri.query()
map locationQuery.valueList('foo') onto 'fooQueryParameterValues'
|
---|---|
Description: | A list of all values found for a query parameter name. These values are URL decoded. |
Type: | List<String> |
Usage: | def matcher = match '/foo/bar/([a-z]+).html$' against location()
|
---|---|
Description: | Matches a regular expression against a string value; the entire value must match. The result of this can not be directly mapped onto a Avro field, but can be used to extract capture groups or conditionally perform a mapping if the pattern is a match. Often it is required to perform non-trivial partial extractions against strings that are taken from the requests. One example would be matching the path of the location with a wild card. It is not recommended to match patterns against the location() or referer() values directly; instead parse as an URI first and match against the relevant parts. In the following example, the matching is much more robust in the presence of unexpected query parameters or fragments compared to matching against the entire location string: def locationUri = parse location() to uri
def pathMatcher = match '^/foo/bar/([a-z]+).html$' against locationUri.path()
when pathMatcher.matches() apply {
map 'fooBarPage' onto 'pageTypeField'
map pathMatcher.group(1) onto 'pageNameField'
}
|
Type: | Matcher |
Usage: | def matcher = match '^/foo/bar/([a-z]+).html$' against location()
// use in conditional mapping
when matcher.matches() apply {
map 'fooBarPage' onto 'pageTypeField'
}
// or map directly onto a boolean field
map matcher.matches() onto 'isFooBarPage'
|
---|---|
Description: | True when the value is present and matches the regular expression or false otherwise. |
Type: | Boolean |
Usage: | // Using group number
def matcher = match '/foo/bar/([a-z]+).html$' against location()
map matcher.group(1) onto 'pageName'
// Using named capture groups
def matcher = match '/foo/bar/(?<pageName>[a-z]+).html$' against location()
map matcher.group('pageName') onto 'pageName'
|
---|---|
Description: | The value from a capture group in a regular expression pattern if the pattern matches, absent otherwise. Groups can be identified by their group number, starting from 1 as the first group or using named capture groups. |
Type: | String |
Usage: | map header('header-name') onto 'fieldName'
|
---|---|
Sources: | browser, json |
Description: | The list of all values associated with the given HTTP header from the incoming request. A HTTP header can be present in a request multiple times, yielding multiple values for the same header name; these are returned as a list. The Avro type of the target field for this mapping must be a list of string: {
"name": "headers",
"type":
[
"null",
{
"type": "array",
"items": ["string"]
}
],
"default": null
}
Note that the array field in Avro itself is nullable and has a default value of null, whereas the items in the array are not nullable. The latter is not required, because when the header is present the elements in the list are guaranteed to be non-null. |
Type: | List<String> |
Usage: | map header('header-name').first() onto 'fieldName'
|
---|---|
Description: | The first of all values associated with the given HTTP header from the incoming request. A HTTP header can be present in a request multiple times, yielding multiple values for the same header name. This returns the first value in that list. |
Type: | String |
Usage: | map header('header-name').last() onto 'fieldName'
|
---|---|
Description: | The last of all values associated with the given HTTP header from the incoming request. A HTTP header can be present in a request multiple times, yielding multiple values for the same header name. This returns the last value in that list. |
Type: | String |
Usage: | map header('header-name').commaSeparated() onto 'fieldName'
|
---|---|
Description: | The comma separated string of all values associated with the given HTTP header from the incoming request. A HTTP header can be present in a request multiple times, yielding multiple values for the same header name. This joins that list using a comma as separator. |
Type: | String |
Usage: | def ua = userAgent()
|
---|---|
Sources: | browser, json |
Description: | Attempts to parse a the result of userAgentString string into a user agent object. Note that this result is not directly mappable onto any Avro field. Instead, the subfields from this object, described below, can be mapped onto fields. When the parsing of the user agent string fails, either because the user agent is unknown or malformed, or because the user agent was not sent by the browser, this value and all subfield values are absent. |
Type: | ReadableUserAgent |
Usage: | map userAgent().name() onto 'uaNameField'
|
---|---|
Description: | The canonical name for the parsed user agent. E.g. ‘Chrome’ for Google Chrome browsers. |
Type: | String |
Usage: | map userAgent().family() onto 'uaFamilyField'
|
---|---|
Description: | The canonical name for the family of the parsed user agent. E.g. Mobile Safari for Apple’s mobile browser. |
Type: | String |
Usage: | map userAgent().vendor() onto 'uaVendorField'
|
---|---|
Description: | The name of the company or organisation that produces the user agent software. E.g. Google Inc. for Google Chrome browsers. |
Type: | String |
Usage: | map userAgent().type() onto 'uaTypeField'
|
---|---|
Description: | The type of user agent that was used. E.g. Browser for desktop browsers. |
Type: | String |
Usage: | map userAgent().version() onto 'uaVersionField'
|
---|---|
Description: | The version string of the user agent software. E.g. 39.0.2171.71 for Google Chrome 39. |
Type: | String |
Usage: | map userAgent().deviceCategory() onto 'uaDeviceCategoryField'
|
---|---|
Description: | The type of device that the user agent runs on. E.g. Tablet for a tablet based browser. |
Type: | String |
Usage: | map userAgent().osFamily() onto 'uaOSFamilyField'
|
---|---|
Description: | The operating system family that the user agent runs on. E.g. OS X for an Apple Mac OS X based desktop. |
Type: | String |
Derived simple value:
Usage: | map userAgent().osVersion() onto 'uaOSVersionField'
|
---|---|
Description: | The version string of the operating system that the user agent runs on. E.g. 10.10.1 for Mac OS X 10.10.1. |
Type: | String |
Usage: | map userAgent().osVendor() onto 'uaOSVendorField'
|
---|---|
Description: | The name of the company or organisation that produces the operating system that the user agent software runs on. E.g. Apple Computer, Inc. for Apple Mac OS X. |
Type: | String |
Usage: | // uses the remoteHost as IP address to lookup
def ua = ip2geo()
// If a load balancer sets custom headers for IP addresses, use like this
def ip = header('X-Custom-Header').first()
def myUa = ip2geo(ip)
|
---|---|
Sources: | browser, json |
Description: | Attempts to turn a IPv4 address into a geo location by performing a lookup into a configured MaxMind GeoIP City database. This database is not distributed with Divolte Collector, but must be provided separately. See the Configuration chapter for more details on this. Note that this result is not directly mappable onto any Avro field. Instead the subfields from this object, described below, can be mapped onto fields. When the lookup for a IP address fails or when the argument is not a IPv4 address, this value and all subfield values are absent. |
Type: | CityResponse |
Usage: | map ip2geo().cityId() onto 'cityIdField'
|
---|---|
Description: | The GeoNames City ID for the geolocation. |
Type: | int |
Usage: | map ip2geo().cityName() onto 'cityNameField'
|
---|---|
Description: | The city name for the geolocation in English. |
Type: | String |
Usage: | map ip2geo().continentCode() onto 'continentCodeField'
|
---|---|
Description: | The ISO continent code for the geolocation. |
Type: | String |
Usage: | map ip2geo().continentId() onto 'continentIdField'
|
---|---|
Description: | The GeoNames Continent Id for the geolocation. |
Type: | int |
Usage: | map ip2geo().continentName() onto 'continentNameField'
|
---|---|
Description: | The continent name for the geolocation in English. |
Type: | String |
Usage: | map ip2geo().countryCode() onto 'countryCodeField'
|
---|---|
Description: | The ISO country code for the geolocation. |
Type: | String |
Usage: | map ip2geo().countryId() onto 'countryIdField'
|
---|---|
Description: | The GeoNames Country Id for the geolocation. |
Type: | int |
Usage: | map ip2geo().countryName() onto 'countryNameField'
|
---|---|
Description: | The country name for the geolocation in English. |
Type: | String |
Usage: | map ip2geo().latitude() onto 'latitudeField'
|
---|---|
Description: | The latitude for the geolocation. |
Type: | double |
Usage: | map ip2geo().longitude() onto 'longitudeField'
|
---|---|
Description: | The longitude for the geolocation. |
Type: | double |
Usage: | map ip2geo().metroCode() onto 'metroCodeField'
|
---|---|
Description: | The Metro Code for the geolocation. |
Type: | String |
Usage: | map ip2geo().timeZone() onto 'timeZoneField'
|
---|---|
Description: | The name of the time zone for the geolocation as found in the IANA Time Zone Database. |
Type: | String |
Usage: | map ip2geo().mostSpecificSubdivisionCode() onto 'mostSpecificSubdivisionCodeField'
|
---|---|
Description: | The ISO code for the most specific subdivision known for the geolocation. |
Type: | String |
Usage: | map ip2geo().mostSpecificSubdivisionId() onto 'mostSpecificSubdivisionIdField'
|
---|---|
Description: | The GeoNames ID for the most specific subdivision known for the geolocation. |
Type: | int |
Usage: | map ip2geo().mostSpecificSubdivisionName() onto 'mostSpecificSubdivisionNameField'
|
---|---|
Description: | The name for the most specific subdivision known for the geolocation in English. |
Type: | String |
Usage: | map ip2geo().postalCode() onto 'postalCodeField'
|
---|---|
Description: | The postal code for the geolocation. |
Type: | String |
Usage: | map ip2geo().subdivisionCodes() onto 'subdivisionCodesField'
|
---|---|
Description: | The ISO codes for all subdivisions for the geolocation in order from least to most specific. |
Type: | List<String> |
Usage: | map ip2geo().subdivisionIds() onto 'subdivisionIdsFields'
|
---|---|
Description: | The GeoNames IDs for all subdivisions for the geolocation in order from least to most specific. |
Type: | List<String> |
Usage: | map ip2geo().subdivisionNames() onto 'subdivisionNames'
|
---|---|
Description: | The names in English for all subdivisions for the geolocation in order from least to most specific. |
Type: | List<String> |