Interesting Splunk MLTK Features for Machine Learning (ML) Development

The Splunk Machine Learning Toolkit is packed with machine learning algorithms, new visualizations, web assistant and much more. This blog sheds light on some features and commands in Splunk Machine Learning Toolkit (MLTK) or Core Splunk Enterprise that are lesser known and will assist you in various steps of your model creation or development. With each new release of the Splunk or Splunk MLTK a catalog of new commands are available. I attempt to highlight commands that have helped in some data science or analytical use-cases in this blog.

Read more

Make Your Dashboards Smile! 😀

Recently a customer was reviewing asset information in Aura Asset Intelligence, our premium application for Splunk, and some interesting data showed up. Users had mobile devices that had emoji’s in their name of their device.

It was a bit surprising at first as it’s not what you would normally expect in a corporate IT environment, but after thinking about it, it’s perfectly normal to see – especially with companies fully adopting BYOD programs these days.

If you weren’t already aware, Splunk can handle different character sets. You can work with non-ascii characters in various different ways – including emojis! From indexing data, searches, alerts, and dashboards. Once you get into the world of non-ascii, you are dealing with Unicode. Unicode is a complex topic. There are many different concepts and terminology to keep straight. But that’s not really the point of this blog 😉 . For more information on Unicode you can start here.

It certainly gets you thinking 🤔 , where could emojis be used in Splunk to inject a bit of fun. Why not give your searches and dashboards a little ❤️ ?

To start, you can use them in searches:

index=main sourcetype=access_combined | eval alt_status = if(status==200,"👍","👎") | stats count by alt_status


You can use them in dashboards:

Response Time single-value panel:
index=main sourcetype=access_combined | stats avg(response) as avg_response | eval avg_response=round(avg_response,1) | eval avg_response = avg_response." ".if(avg_response < 30," 👍  "," 👎 ")

Errors single-value panel:
index=main sourcetype=access_combined | stats count(eval(status >= 500)) as errors count as total | eval error_rate=round((errors/total)*100,1) | eval alt_status = if(error_rate >= 3, "😕","😄")| fields alt_status

Status Codes table panel:
index=main sourcetype=access_combined | stats count by status | eval alt_status = case(status >= 500, "😠",status >=400, "😕", status >= 200, "😄", 1==1,"❓")


Or even using them in alerts (results will vary depending if the target of the alert can handle Unicode). Here’s an email example with the results embedded inline:


Maybe you can live on the wild side and even ask your developers to start using emoji’s in their logs….


Ok, that’s fun and all, but is there a practical use for emoji’s in Splunk? Sure! Why not give your dashboards some more visual eye candy when it comes to location data. You can easily create a lookup that maps Country name to their emoji flag. 

Top Country single-value panel:
index=main sourcetype="access_combined" | top limit=1 clientip | iplocation clientip | eval Country = if(Country=="", "Unknown", Country) | lookup emoji_flags name as Country OUTPUT emoji | fillnull value="❓" emoji | eval top_country= Country." ".emoji | fields top_country

Requests By Country table panel:
index=main sourcetype="access_combined" | stats count by clientip | iplocation clientip | eval Country = if(Country=="", "Unknown", Country) | stats sum(count) as total by Country | lookup emoji_flags name as Country OUTPUT emoji | fillnull value="❓" emoji | sort - total

You can download the flag to emoji lookup CSV here to use in your own searches.

The possibilities are endless! So have some fun with emojis in your dashboards, lets just hope that at no point do your dashboards or data go to 💩 …


Looking to expedite your success with Splunk? Click here to view our Splunk Professional Service offerings.

© Discovered Intelligence Inc., 2020. Unauthorised use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Discovered Intelligence, with appropriate and specific direction (i.e. a linked URL) to this original content.

Quick Guide to Outlier Detection in Splunk

There are multiple (almost discretely infinite) methods of outlier detection. In this blog I will highlight a few common and simple methods that do not require Splunk MLTK (Machine Learning Toolkit) and discuss visuals (that require the MLTK) that will complement  presentation of outliers in any scenario.  This blog will cover the widely accepted method of using averages and standard deviation for outlier detection. The visual aspect of detecting outliers using averages and standard deviation as a basis will be elevated by comparing the timeline visual against the custom Outliers Chart and a custom Splunk’s Punchcard Visual.

Some Key Concepts

Understanding some key concepts are essentials to any Outlier Detection framework. Before we jump into Splunk SPL (Search Processing Language)  there are basic ‘Need-to-know’ Math terminologies and definitions we need to highlight:

  • Outlier Detection Definition:  Outlier detection is a method of finding events or data that are different from the norm.
  • Average: Central value in set of data.
  • Standard Deviation: Measure of spread of data. The higher the Standard Deviation the larger the difference between data points. We will use the concept of standard substantially in today’s blog. To view the manual method of standard deviation calculation click here.
  • Time Series: Data ingested in regular intervals of time. Data ingested in Splunk with a timestamp and by using the correct ‘props.conf’ can be considered “Time Series” data   

Additionally, we will leverage aggregate and statistic Splunk commands in this blog. The 4 important commands to remember are:

  • Bin:  The ‘bin’ command puts numeric values (including time) into buckets. Subsequently the ‘timechart’ and ‘chart’ function use the bin command under the hood
  • Eventstats: Generates statistics (such as avg,max etc) and adds them in a new field. It is great for generating statistics on ‘ALL’ events
  • Streamstats: Similar to ‘stats’ , streamstats calculates statistics at the time the event is seen (as the name implies). This feature is undoubtedly useful to calculate ‘Moving Average’ in additional to ordering events
  • Stats: Calculates Aggregate Statistics such as count, distinct count, sum, avg over all the data points in a particular field(s)

Data Requirements

The data used in this blog is Splunk’s open sourced “Bots 2.0” dataset from 2017. To gain access to this data please click here. Downloading this data set is not important, any sample time series data that we would like to measure for outliers is valid for the purposes of this blog. For instance, we could measure outliers in megabytes going out of a network OR # of logins in a applications using the using the same type of Splunk query. The logic used to the determine outliers is highly reusable.

Using SPL

There are four methods commonly seen methods applied in the industry for basic outlier detection. They are in the sections below:

1. Using Static Values

The first commonly used method of determining an outlier is by constructing a flat threshold line. This is achieved by creating a static value and then using logic to determine if the value is above or below the threshold. The Splunk query to create this threshold is below :

<your spl base search> … | timechart span=6h sum(mb_out) as mb_out
| eval threshold=100 
| eval isOutlier=if('mb_out' > threshold, 1, 0)
Static threshold timeline visual
Static threshold timeline visual

2. Average with Static Multiplier

In addition to using arbitrary static value another method commonly used method of determining outliers, is a multiplier of the average. We calculate this by first calculating the average of your data, following by selecting a multiplier. This creates an upper boundary for your data. The Splunk query to create this threshold is below:

<your spl base search> …  
| timechart span=12h sum(mb_out) as mb_out 
| eventstats avg("mb_out") as average 
| eval threshold=average*2 
| eval isOutlier=if('mb_out' > threshold, 1, 0)
Average + Static threshold timeline visual
Average + Static threshold timeline visual

3. Average with Standard Deviation

Similar to the previous methods, now we use a multiplier of standard deviation to calculate outliers. This will result in a fixed upper and lower boundary for the duration of the timespan selected. The Splunk query to create this threshold is below:

<your spl base search> ... | timechart span=12h sum(mb_out) as mb_out 
 | eventstats avg("mb_out") as avg stdev("mb_out") as stdev 
 | eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2))
 | eval isOutlier=if('mb_out' < lowerBound OR 'mb_out' > upperBound, 1, 0) 
2*Standard Deviation timeline visual
2*Standard Deviation timeline visual

Notice that with the addition of the lower and upper boundary lines the timeline chart becomes cluttered.

4. Moving Averages with Standard Deviation

In contrast to the previous methods, the 4th most common method seen is by calculating moving average. In short, we calculate the average of data points in groups and move in increments to calculate an average for the next group. Therefore, the resulting boundaries will be dynamic. The Splunk search to calculate this is below:

<your spl base search> ... | timechart span=12h sum(mb_out) as mb_out 
 | streamstats window=5 current=true avg("mb_out") as avg stdev("mb_out") as stdev
 | eval lowerBound=(avg-stdevexact(2)), upperBound=(avg+stdevexact(2)) 
 | eval isOutlier=if('mb_out' < lowerBound OR 'mb_out' > upperBound, 1, 0) 
Moving Average with Standard Deviation timeline chart
Moving Average with Standard Deviation timeline chart

Tips: Notice the “isOutliers” line in the timeline chart, in order to make smaller values more visible format the visual by changing the scale from linear to log format.

Using the MLTK Outlier Visualization

Splunk’s Machine Learning Toolkit (MLTK) contains many custom visualization that we can use to represent data in a meaningful way. Information on all MLTK visuals detailed in Splunk Docs. We will look specifically at the ‘Outliers Chart’. At the minimum the outlier chart requires 3 additional fields on top of your ‘_time’ & ‘field_value’. First, would need to create a binary field ‘isOutlier’ which carries the value of 1 or 0, indicating if the data point is an outlier or not. The second and third field are ‘lowerBound’ & ‘upperBound’ indicating the upper and lower thresholds of your data. Because the outliers chart trims down your data by displaying only the value of data point and your thresholds, we can conclude through use that it is clearer and easier to understand manner. As a recommendation it should be incorporated in your outliers detection analytics and visuals when available.

Continuing from the previous paragraph, take a look at the below snippets at how the impact the outliers chart is in comparison to the timeline chart. We re-created the same SPL but instead of applying timeline visual applied the ‘Outliers Chart’ in the same order:

Using and outliers chart to display outliers
Static threshold w outliers chart
Using outliers chart to display a static threshold (average * multiplier)
Average + Static threshold timeline visual
Using outliers chart to display 2*Standard Deviation outliers chart
2*Standard Deviation outliers chart
Using outliers chart for moving averages
Moving Average with Standard Deviation outliers chart
AdvantagesDisadvantages
Cleaner presentation and less clutterYou need to install Splunk MLTK (and its pre-requisites) to take advantage of the outliers chart
Easier to understand as determining the boundaries becomes intuitive vs figuring out which line is the upper or lower thresholdUnable to append additional fields in the Outliers chart

Adding Depth to your Outlier Detection

Determining the best technique of outlier detection can become a cumbersome task. Hence, having the right tools and knowledge will free up time for a Splunk Engineer to focus on other activities. Creating static thresholds over time for the past 24hrs, 7 days, 30 days may not be the best approach to finding outliers. A different way to measure outliers could be by looking at the trend on every Monday for the past month or 12 noon everyday for the past 30 days. We accomplish this by using two simple and useful eval functions:

| eval HourOfDay=strftime(_time, "%H") 
| eval DayOfWeek=strftime(_time, "%A") 

Using Eval Functions in SPL

Continuing from the previous section, we incorporate the two highlighted eval functions in our SPL to calculate the average ‘mb_out’. However, this time the average is based on the day of the week and the hour of the day. There are a handful of advantages of this method:

  • Extra depth of analysis by adding 2 additional fields you can split the data by
  • Intuitive method of understanding trends

Some use cases of using the eval functions are as follows:

  • Network activity analysis
  • User behaviour analysis
Calculate averages based on day of week and hour of day
Tables representing averages by DayOfWeek & HourOfDay

Visualizing the Data!

We will focus on two visualizations to complement our analysis when utilizing the eval functions. The first visual, discussed before, is the ‘Outliers Chart’ which is a custom visualization in Splunk MLTK. The second visual is another custom visualization ‘PunchCard’, it can be downloaded from Splunkbase here (https://splunkbase.splunk.com/app/3129/).

The outliers chart has a feature which results in a ‘swim lane’ view of a selected field/dimension and your data points while highlighting points that are outliers. To take advantage of this feature, we will use a Macro “splitby” which creates a hidden field(s) “_<Field(s) you want data to split by>”. The rest of the SPL is shown below

< your base SPL search >  ...  | eventstats avg("mb_out") as avg stdev("mb_out") as stdev  by "HourOfDay" 
| eval avg=round(avg,2) 
| eval stdev=round(stdev,2)
| eval lowerBound=(avg-stdev*exact(2)), upperBound=(avg+stdev*exact(2)) 
| eval isOutlier=if('mb_out' < lowerBound OR 'mb_out' > upperBound, 1, 0) 
| `splitby("HourOfDay")` 
| fields _time, "mb_out", lowerBound, upperBound, isOutlier, * 
| fields - _raw source kb* byt* 
| table _time "mb_out" lowerBound upperBound isOutlier *

This search results in an Outlier Chart that looks like this:

Outliers Chart split by hour of day
Outliers Chart split by hour of day

The Outliers Chart has the capability to split by multiple fields, however in our example splitting it by a single dimension “HourOfDay” is sufficient to show its usefulness.

The PunchCard visual is the second feature we will use to visualize outliers. It displays cyclical trends in our data by representing aggregated values of your data points over two dimensions or fields. In our example, I’ve calculated the sum of outliers over a month based on “DayOfWeek” as my first dimension and “HourOfDay” as my second dimension. I’ve adding the outliers of these two fields and displaying it using the PunchCart visual. The SPL and image for this visual is show below:

< your base SPL search > ... | streamstats window=10 current=true avg("mb_out") as avg stdev("mb_out") as stdev by "DayOfWeek" "HourOfDay"
| eval avg=round(avg,2)
| eval stdev=round(stdev,4)
| eval lowerBound=(avg-stdevexact(2)), upperBound=(avg+stdevexact(2))
| eval isOutlier=if('mb_out' < lowerBound OR 'mb_out' > upperBound, 1, 0)
| splitby("DayOfWeek","HourOfDay")
| stats sum(isOutlier) as mb_out by DayOfWeek HourOfDay
| table HourOfDay DayOfWeek mb_out
PunchCard Visual
PunchCard Visual

Summary and Wrap Up

Trying to find outliers using Machine Learning techniques can be a daunting task. However I hope that this blog gives an introduction on how you can accomplish that without using advanced algorithms. Consequently, using basic SPL and built-in statistic functions can result in visuals and analysis that is easier for stakeholders to understand and for the analyst to explain. So summarizing what we have learnt so far:

  1. One solution does not fit all. There are multiple methods of visualizing your analysis and exploring your result through different visual features should be encouraged
  2. Use Eval functions to calculate “DayOfWeek” and “HourOfDay” wherever and whenever possible. Adding these two functions provides a simple yet powerful tool for the analyst to explore the data with additional depth
  3. Trim or minimize the noise in your Outliers visual by using the Outliers Chart. The chart is beneficial in displaying only your boundaries and outliers in your data while shaving all other unnecessary lines
  4. Use “log” scale over “linear” scale when displaying data with extremely large ranges


Looking to expedite your success with Splunk? Click here to view our Splunk Professional Service offerings.

© Discovered Intelligence Inc., 2020. Unauthorised use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Discovered Intelligence, with appropriate and specific direction (i.e. a linked URL) to this original content.

Harnessing Ingest-Time Eval Fields

Anyone who is familiar with writing search queries in Splunk would admit that eval is one of the most regularly used commands in their SPL toolkit. It’s up there in the league of stats, timechart, and table.

For the uninitiated, eval, just like in any other programming context, evaluates an expression and returns the result. In Splunk, especially when searching, holds the same meaning as well. It is arguably the Swiss Army knife among SPL commands as it lets you use an array of operations like mathematical, statistical, conditional, cryptographic, and text formatting operations to name a few.

Read more about eval here and eval functions here.

What is an Ingest-time Eval?

Until Splunk v7.1, the eval command was only limited to search time operations. Since the release of 7.2, eval has also been made available at index time. What this means is that all the eval functions can now be used to create fields when the data is being indexed – otherwise known as indexed fields. Indexed fields have always been around in Splunk but didn’t have the breadth of capabilities for populating them until now.

Ingest-time eval doesn’t overlap with other common index-time configurations such as data filtering and routing, but only complements it. It lets you enrich the event with fields that can be derived by applying the eval functions on existing data/fields in the event.

One key thing to note is that it doesn’t let you apply any transformation to the raw event data, like masking.

When to use Ingest-time eval

Ingest-time eval can be used in many different ways, such as:

  • Adding data enrichment such as a data center field based on a host naming convention
  • Normalizing fields such adding a field with a FQDN when the data only contains a hostname
  • Using additional fields used for filtering data before indexing
  • Performing common calculations such as adding a GB field when there is only a MB field or the length of a field with a string

Ingest-time eval can also be used with metrics. Read more here.

When not to use Ingest-time eval

Ingest-time eval, like index-time field extractions, adds a performance overhead on the indexers or heavy forwarders (whichever is handling the parsing of data based on your architecture) as they will be evaluated on all events of the specific sourcetypes you define it for. Since the new fields are going to be permanently added to the data as they are indexed, the increase in disk space utilization needs to be accounted for as well. Also there is no reverting these new fields as these are indexed/persisted in the index. To remove the data, the ingest-time eval configurations would need to be disabled/deleted and letting the affected data age out.

When using Ingest-time eval also consider the following:

  • Validate if the requirement is something that can be met by having an eval function at search time – usually this should be yes!
  • Always use a new field name that’s not part of the event data. There should be no conflict with the field name that Splunk automatically extracts with the `KV_MODE=auto` extraction.
  • Always ensure you are applying eval on _raw data unless you have some index time field extraction that’s configured ahead of it in the transforms.conf.

Always ensure that your indexers or heavy forwarders have adequately hardware provisioned to handle the extra load. If they are already performing at full throttle, adding an extra step of processing might be that final straw. Evaluate and upgrade your indexing tier specs first if needed.

Now, lets see it in action!

Here is an Example…

Lets assume for a brief moment you are working in Hollywood, with the tiny exception that you don’t get to have coffee with the stars but just work with their “PCI data”. Here’s a sample of the data we are working with. It’s a sample of purchase details that some of my favorite stars made overseas (Disclaimer: The PCI data is fake in case you get any ideas 😉):

2019-12-09 23:46:44,283 - name=Tom Hardy, amount=2620.08063223, currency=USD, dest_country=Tanzania, cc=8888192373782645, cvc=151
2019-12-09 23:46:45,284 - name=Ryan Reynolds, amount=4229.66241228, currency=USD, dest_country=Canada, cc=9999047123456789, cvc=101
2019-12-09 23:46:48,288 - name=Frances McDormund, amount=6033.83328530, currency=USD, dest_country=Budapest, cc=9999513562353615, cvc=856
2019-12-09 23:47:11,320 - name=Daniel Day-Lewis, amount=5603.00466255, currency=USD, dest_country=Iceland, cc=9999463984323578, cvc=029
2019-12-09 23:47:21,333 - name=Clint Eastwood, amount=8321.50139290, currency=USD, dest_country=Sri Lanka, cc=8888847290573791, cvc=347
2019-12-09 23:47:22,335 - name=Tom Hardy, amount=3773.86328145, currency=USD, dest_country=Tanzania, cc=8888192373782645, cvc=151
2019-12-09 23:47:23,336 - name=Jeff Goldblum, amount=9475.63602049, currency=USD, dest_country=Sri Lanka, cc=8888485176493782, cvc=730

Now we are going to create some ingest-time fields:

  1. Making the name to all upper case (just for the sake of it)
  2. Rounding off the amount to two decimal places
  3. Applying a bank field based on the starting four digit of the card number
  4. Applying md5 hashing on the card number
  5. Applying a mask to the card number

First things first, lets set up our props.conf for the data with all the recommended attributes defined. What really matters in our case here is the TRANSFORMS attribute.

[finlog]
SHOULD_LINEMERGE=false
LINE_BREAKER=([\r\n]+)
TRUNCATE=10000
TIME_FORMAT=%Y-%m-%d %H:%M:%S,%f
MAX_TIMESTAMP_LOOKAHEAD=25
TIME_PREFIX=^
TRANSFORMS = fineval1, fldext1, fineval2 # order of values for transforms matter

Now let’s define how the transforms.conf should look like. This essentially is the place where we define all our eval expressions. Each expression is comma separated.

[fineval1]
INGEST_EVAL= uname=upper(replace(_raw, ".+name=([\w\s'-]+),\stime.*","\1")), purchase_amount=round(tonumber(replace(_raw, ".+amount=([\d\.]+),\scurrency.*","\1")),2)
# notice how in each case we have to operate on _raw as name and amount fields are not index-time extracted.

[fldext1]
REGEX = .+cc=(\d{15,16})
FORMAT = cc::"$1"
WRITE_META = true

[fineval2]
# INGEST_EVAL= cc=md5(replace(_raw, ".+cc=(\d{15,16})","\1"))
# have commented above as we need not apply the eval to the _raw data. fldext1 here does index time field extraction so we can apply directly on the extracted field as below...
INGEST_EVAL= cc1=md5(cc), bank=case(substr(cc,0,4)=="9999","BNC",substr(cc,0,4)=="8888","XBS",1=1,"Others"), cc2=replace(cc, "(\d{4})\d{11,12}","\1xxxxxxxxxxxx")

All the above settings should be deployed to the indexer tier or heavy forwarders if that’s where the data is originating from.

A couple things to note – you can define your ingest-time eval in separate stanzas if you choose to define them separately in the props.conf. Below is a use case for that. Here I have defined an index time field extraction to extract the value of card number. Then in a separate stanza, I used another ingest-time eval stanza to process on that extracted field. This is a good use case of reusability of regex (instead of applying it on _raw repeatedly) in case you need to do more than one operations on specific set of fields.

Now we need to do a little extra work that’s not common with a search time transforms setting. We have to add all the new fields created above to fields.conf with the attribute INDEXED=true denoting these are index time fields. This should be done in the Search Head tier.

[cc1]
INDEXED=true

[cc2]
INDEXED=true

[uname]
INDEXED=true

[purchase_amount]
INDEXED=true

[bank]
INDEXED=true

The result looks like this:

One important note about implementing Ingest-time eval configurations, is that they require manual edits to .conf files as there is no Splunk web option for it. If you are a Splunk Cloud customer, you will need to work with Splunk support to deploy them to the correct locations depending on your architecture.

OK so that’s a quick overview of Ingest-time eval. Hope you now have a pretty fair understanding of how to use them.

Looking to expedite your success with Splunk? Click here to view our Splunk Professional Service offerings.

© Discovered Intelligence Inc., 2020. Unauthorised use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Discovered Intelligence, with appropriate and specific direction (i.e. a linked URL) to this original content.

Forecasting Time Series Data Using Splunk Machine Learning Toolkit – Part II

Part II of the Forecasting Time Series blog provides a step by step guide for fitting an ARIMA model using Splunk’s Machine Learning Toolkit. ARIMA models can be used in a variety of business use cases. Here are a few examples of where we can use them:

  • Detecting anomalies and their impact on the data
  • Predicting seasonal patterns in sales/revenue
  • Streamline short-term forecasting by determine confidence intervals

From Part 1 of the blog series, we identified how you can use Kalman Filter for forecasting. The observation we made from the resulting graphs demonstrated how it was also useful in reducing/filtering noise (which is how it gets its name ‘Filter’) . On the other hand ARIMA belongs to a different class of models. In comparison to a Kalman filter, ARIMA models works on data that has moving averages over time or where the value of a data point is linearly depending on its previous value(s). In these two scenarios it makes more sense to use ARIMA over Kalman Filter. However good judgement, understanding of the data-set and objective of forecasting should always be the primary method of determining the algorithm.

Objective

Part II of this blog series aims to familiarize a Splunk user using the MLTK Assistant for forecasting their time series data, particularly with the ARIMA option. This blog is intended as a guide in determining the parameters and steps to utilize ARIMA for your data. In fact, it is a generalized template that can be used with any processed data to forecasting with ARIMA in Splunk’s MLTK. An advantage of using Splunk for forecasting is its benefit in observing the raw data side by side with the predicted data and  once the analysis is complete, a user can create alerts or other actions based on a future prediction. We will talk more about creating alerts  based on predicted or forecasted data in a future blog (see what I predicted there ;)?)

If you have read part I of our blog, we will reuse the same dataset process_time.csv for this part. If not, click here to navigate to part I to understand the dataset.

Fundamental Concept for ARIMA Forecasting

A fundamental concept to understand before we move ahead with ARIMA is that the model works best with stationary data. Stationary data has a constant trend that does not change overtime. The average value is also independent of time as another characteristic of stationary data.

A simple example of non-stationary data is are the two graphs below, the first without a trendline, the second with a yellow trendline to show an average increase in the value of our data points. The data needs to be transformed into stationary data to remove the increasing trend.

Using Splunk’s autoregress command we can apply differencing to our data. The results are immediately visible through line chart visual! The below command can be used on any time series data set to demonstrate differencing.

… | autoregress value | eval new_value=value-value_p1 | fields _time new_value

Without creating a trendline for the below graph we can see that the data fluctuates around a constant mean value of ‘0’, we can say that differencing is applied. Differencing to make the data stationary can increase the accuracy and fit of our ARIMA forecast. To read more about differencing and other rules that apply on ARIMA, navigate to the Duke URL provided in the useful link section:

Differencing is simply subtracting the current and previous data points. In our example we are only applying differencing by an order of 1, meaning we will subtract the present data point by one data point in reverse chronological order. There are different types of non-stationary graphs, which require in-depth domain knowledge of ARIMA, however we simplify it in this blog and use differencing to remove the non-constant trend in this example 😊!

From part 1 of this blog series we can see that our data does not have a constant trend, as a result we apply differencing to our dataset. The step to apply differencing from the MLTK Assistant is detailed in the ‘Determining Starting Points’ section. Differencing in ARIMA allows the user to see spikes or drops (outliers) in a different perspective in comparison to Kalman Filter.

Walkthrough of MLTK Assistant for ARIMA

ARIMA is a popular and robust method of forecasting long-term data. From blog 1 we can describe Kalman Filter’s forecasting capabilities as extending the existing pattern/spikes, sort of a copy-paste method which may be advantageous when forecasting short-term data. ARIMA has an advantage in predicting data points when the we are uncertain about the future trend of the data points in the long-term. Now that we have got you excited about ARIMA, lets see how we can use it in Splunk’s MLTK!

We use the Machine Learning Toolkit Assistant for forecasting timeseries data in Splunk. Navigate to the Forecast Time Series Assistant page (Under the Classic Menu option) and use the Splunk ‘inputlookup’ command to view the process_time.csv file.

|inputlookup process_time.csv

Once we add the dataset click on Algorithm and select ‘ARIMA’ (Autoregressive Integrated Moving Average), and ‘value’ as your field to forecast. You will notice that the ARIMA arguments will appear.

There are three arguments that make up the ARIMA model:

ArgumentDefinition
AutoRegressive – pAuto regressive (AR) component refers to the use of past values in the regression equation. Higher the value the more past terms you will use in the equation. This concept is also called ‘lags’. Another way of describing this concept is if the value your data point is depending on its previous value e.g process time right now will depend on the process time 30 seconds before (from our data set)
Integrated – dThe d represents the degrees of differencing as discussed in the previous section. This makes up the integrated component of the ARIMA model and is needed for the stationary assumption of the data.
Moving Average – qMoving Average in ARIMA refers to the use of past errors in the equation. It is the use of lagging (like AR) but for the error terms.
Determine Starting Points

Identify the Order of Differencing (d)

As a refresher, we utilized the same dataset we worked with in part 1 of the blog series regarding the Kalman filter. As I input my process_time.csv file in the assistant, I enter the future_timespan variable as 20 and the holdback as 20. I’ve kept the confidence interval as default value ‘95’. Once the argument values are populated click on ‘Forecast’ to see the resulting graphs.

As a note, my ARIMA arguments described above are ARIMA(0,0,0) which can represented as a mathematics function ARIMA(p,d,q), where p,d,q = 0. We use this functional representation of the variables frequently in this blog for consistently with generally used mathematical languages.

When we click on forecast, observe the line chart graph from the results that show. This above graph confirms that the data is non-stationary, we will apply differencing to make it stationary. We can accomplish this by increasing the value of our ‘d’ argument from ‘0’ to ‘1’ in the forecasting assistant and clicking on forecast again. This step is essential to meet one of the main criteria’s of using ARIMA discussed in the ‘Fundamental Concept for ARIMA’ section.

Identifying AR(p) and MA(q)

After we apply differencing to our data our next step is to determine the AR or MR terms that mitigate any auto correlation in our data. There are two popular methods of estimating the these two parameters. We will expand on one of the methods in this blog.

Method 1

The first method for estimating the value of ‘p’ and ‘q’ is to use the Akaiki Information Criteria (AIC) and the Baysian Information Criteria (BIC), however using them is outside the scope of the blog as we will use a different method from the MLTK given the tools we have at hand. For the curious mind, the following blog contains detailed information on AIC and BIC to determine our ‘p’ and ‘q’ values:

Method 2

After we have applied differencing to our time series data, we review the PCAF and the ACF plots to determine an order for AR(q) or MA(q). We will apply ARIMA(0,1,0) in our ARIMA  MLTK assistant and then click on ‘Forecast’ to view the results of the graph. The below image shows the values that we entered in the assistant:

Once we click on forecast, we view the PACF plot to estimate a value for AR(p) model. Similarly we use the ACF plot to estimate a value for MA(q). The graphs are shown in the screenshot below.

We examine the PACF plot for a suggestion for our AR value, by counting the prominent high spikes. From the plot below I’ve circled the prominent spikes in the PACF graph. The value of AR (p) that we pick is 4.

We examine the ACF plot for a suggestion for our MA value, by counting the prominent high spikes. From the plot below I’ve circled the prominent spikes in the ACF graph. The value of AR (q) that we pick is 5.

We can now add in the values for the parameter integrated (d) – 1 and our estimates for AR – 4, and MA -5 in the Splunk MLTK. Once added in the assistant, click on ‘Forecast’.

For this particular combination for values we can see that once we click on ‘Forecast’, we get an error regarding the ‘invertability’ of the dataset as shown in the screenshot below. Without going too deep into the mathematics, it means that our model does not converge when it forecasts. I’ve added a link in the references and links section at the end for your interest! This error can be resolved by adjusting the values of model, similar to a ‘trail an error’ approach explained in the next section.

Optimize Your P and Q Values

Estimating this method of AR and MA is subjective to what can be considered as ‘prominent spikes’, this can result in estimating values of ‘q’ and ‘p’ that are not an optimal fit for the data. To resolve this we constructed a table displaying the R-squared and Root Mean Square Error (RMSE) values from the model error statistics from the MLTK assistance, for each combination of ‘p’ and ‘q’. An empty cell indicates an invertability error, while the other cells contain the value of R-squared and RMSE.

A higher R-squared indicates a better fit the model has on the data. R-squared is the amount of variability that the model can explain on the process time data points.

On the other hand, the lower the RMSE is the better the fit of the model. Root mean square is the difference between the data points the model predicted and our holdback points from the raw data.

We pick values of ‘p’ and ‘q’ that minimize RMSE and maximize R-square as the best fit to our data. From the table below we can see that q=5 and p=5 optimize the prediction for us.

Integrated (d) = 0AutoRegressive (p)
012345
Moving Average (q)0R2 Stat: -0.0015 RMSE:  19.31R2 Stat: 0.1976 RMSE:  16.35R2 Stat: 0.1977 RMSE:  16.34R2 Stat: 0.2699 RMSE:  15.60R2 Stat: 0.2696 RMSE:  15.60R2 Stat: 0.3114 RMSE:  15.14
1R2 Stat: 0.2401 RMSE:  15.91R2 Stat: 0.2486 RMSE: 15.82R2 Stat: 0.2780 RMSE:  15.51R2 Stat: 0.2329 RMSE:  15.98R2 Stat: 0.4053 RMSE:  14.07
2R2 Stat: 0.2452 RMSE:  15.85R2 Stat: 0.3017 RMSE:  15.25R2 Stat: 0.3214 RMSE:  15.03
3R2 Stat: 0.2872 RMSE:  15.41R2 Stat: 0.4185 RMSE:  13.92R2 Stat: 0.4428 RMSE:  13.62R2 Stat: RMSE:R2 Stat: 0.4343 RMSE:  13.72R2 Stat: 0.4456 RMSE:  13.58
4R2 Stat: 02826 RMSE:  15.46R2 Stat: 0.4185 RMSE:  13.92R2 Stat:0.3241 RMSE:  15.00
5R2 Stat: 0.2826 RMSE:  15.46R2 Stat: 0.3133 RMSE:  15.99R2 Stat: 0.4385 RMSE:  13.67R2 Stat: 0.4515 RMSE:  13.52
Viewing Your Results

Once we have picked the values of p and q that optimize our model, we can go ahead plug the numbers in our assistant and click on forecast to display the forecasted graph. The values to plug in the assistant are as follows: p-5, d-1, q-5, holdback-20, forecast-20. The screenshots below show the values entered in the assistant and the resulting forecast graph.

A this point many would be satisfied with the forecast as the visual of the data itself is enough to analyse, asses and then make a judgement on the action(s) to take. The next step details how you can view the data and lists some ideas of alerts that can be constructed

Next Step

We can view the SPL used powering the graph by either clicking on ‘Open in Search’ or ‘ ‘Show SPL’. I prefer the ‘Open in Search’ option as it automatically open a new tab, allowing me to further understand how the SPL is constructed in the forecast and to view the data. Once a tab browser tab opens click on the ‘statistics’ option to view the raw data points, predicted data points and the confidence intervals created by our model. I have added the SPL from the image for your convenience below:

| inputlookup process_time.csv | fit ARIMA _time value holdback=20 conf_interval=95 order=5-1-5 forecast_k=40 as prediction | `forecastviz(40, 20, "value", 95)`

I added another filter to my SPL to only view the forecasted process data from the ARIMA model as shown below:

| inputlookup process_time.csv | fit ARIMA _time value holdback=20 conf_interval=95 order=5-1-5 forecast_k=40 as prediction | `forecastviz(40, 20, "value", 95)` | search "lower95(prediction)"=*

The resulting table lists all the necessary data in a clean tabular format (that we are all familiar with) for creating alerts based on our predicted process time. Here are some ideas on creating alerts based on the data we worked with:

  1. Create alert when the predicted value of the process time goes above a certain threshold
  2. Create alert when the average process time over a timespan is predict to stay above normal limits
  3. Create alert based on outlier detection, when the predicted data is outside the lower or upper boundaries

Creating alerts based on our predict data allows us to be proactive of potential increase or decrease of our input variable

Summarizing ARIMA Forecasting in MLTK

Lets summarize what we have discussed so far in this blog:

  1. A mathematical prerequisites of the model
  2. Determining differencing requirement
  3. Determine starting values for AR() and MA()
  4. Optimize your AR() and MA() values based on error statistics
  5. Forecast your data based on values decided in Step 4
  6. View data and determine any alerts conditions

Prior to the above steps, we need to ensure that our data has been pre-processed or transformed in a MLTK-friendly manner. The pre-process steps include but not limited to; ensuring no gaps in the time series data, determine the relevance of data to forecasting, group data in time intervals (30 second, 1 minute etc). The pre-processing steps are important to create uniformity in the data input allow Splunk’s MLTK to analyse and forecast your data.

Hopefully this blog, streamlines the process of forecasting using ARIMA in Splunk’s MLTK. There are limitations as with any algorithm on forecasting using this method, as it involves a more theoretical knowledge in mathematics I’ve added two links in the the useful links section (first link is navigates you to on ‘datascienceplus.com’ and the second to ’emeraldinsight.com’) to further read on them.


Looking to expedite your success with Splunk? Click here to view our Splunk Professional Service offerings.

© Discovered Intelligence Inc., 2019. Unauthorised use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Discovered Intelligence, with appropriate and specific direction (i.e. a linked URL) to this original content.

Useful Links
https://www.datascience.com/blog/introduction-to-forecasting-with-arima-in-r-learn-data-science-tutorials
http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit
https://datascienceplus.com/forecasting-with-arima-part-i/
https://www.emeraldinsight.com/doi/full/10.1108/14635780710746902
https://people.duke.edu/~rnau/411arim2.htm

FloCon 2019 and the Data Challenges Faced by Security Teams

In early January of this year I was able to attend FloCon 2019 in New Orleans. In this posting, I will provide a little bit of insight into this security conference, some of the sessions that I attended, and detail some of the major data challenges facing security teams.

It was not hard to convince me to go to New Orleans for obvious reasons: the weather is far nicer than Toronto during the winter, cajun food and the chance to watch Anthony Davis play at the Smoothie King Center. I also decided to stay at a hotel off of Bourbon Street which happened to be a great decision. Others attending FloCon decided to arrange accommodations on Bourbon Street ended up needing ear plugs to get a good night’s rest! Enough about that though and let’s talk about the conference itself.

About FloCon

FloCon is geared towards researchers, developers, security experts and analysts looking for techniques to analyze and visualize data for protection and defense of networked systems. The conference is quite small, with a few hundred attendees rather than the 1000s that attend conferences I have attended in the past, like Splunk Conf and the MIT Sloan Conference. However, the smaller number of attendees in no way translated to a worse experience. FloCon was mostly single track (other than the first day), which meant I did not have to reserve my spot for popular session. The smaller number also resulted in greater audience participation.

The first day was split between two tracks: (1) How to be an Analyst and (2) BRO training. I chose to attend the “How to be an Analyst” track. For the first half of the day, participants of the Analyst track were given hypothetical situations which was followed by discussions on hypothesis testing and what kind of data would be of interest to an analyst in order to determine a positive threat. The hypothetical situation in this case was potential vote tampering (remember this is a hypothetical situation). The second half of the day was supposed to be a team game which involved questions and scoring based multiple choice answers. However, the game itself could not be scaled out to the number of participants, therefore the game was completed with all participants working together, which led to some interesting discussions. The game needs some work, but it was interesting to see how different participants thought through the scenarios and how individuals would go about investigating Indicators Of Compromise (IOCs).

The remaining three days saw different speakers present their research on machine learning, applying different algorithms to network traffic data,  previous work experience as penetration testers, key performance indicators, etc. The most notable of the speakers being Jason Chan, VP of Cloud Security at Netflix. Despite some of the sessions being heavily research based and with a lot of graphs (some of which I’m sure went over the heads of some of us in attendance), common themes kept arising about the challenges faced by organizations – all of which Discovered Intelligence has encountered on projects. I have identified some of these challenges below.

Challenge: Lack of Data Standardization Breaks Algorithms

I think everyone knows that scrubbing data is a pain. It does not help that companies often change the log format with new releases of software. I have seen new versions break queries because a simple change has cascading effects on the logic built into the SIEM. Changing the way information is presented can also break machine learning algorithms, because the required inputs are no longer available.

Challenge: Under Investing in Fine-Tuning & Re-iterations

Organizations tend to underestimate the amount of time needed to fine-tune queries intended for threat hunting and anomaly detection. The first iteration of a query is rarely perfect and although it may trigger an IOC, analysts will start to realize there are exceptions and false positives. Therefore, overtime teams must fine-tune the original queries to be more accurate. The security team for the Bank of England spends approximately 80% of their time developing and fine-tuning use cases! The primary goal being to eliminate alert fatigue, and to keep everything up-to-date in an ever changing technological world. I do not think there is a team out there that gets “too few” security alerts. For most organizations, the reality is: that there are too many alerts and not enough resources to investigate. Since there are not enough resources, fine-tuning efforts never happen and analysts will begin to ignore alerts which trigger too often.

An example of the first iteration of an alert which can generate high volumes is failed authentications to a cloud infrastructure. If the organization utilizes AWS or Microsoft Cloud, they may see a huge number of authentication failures for their users. Why and how? Bad actors are able to identify sign-in names and emails from social media sites, such as LinkedIn or company websites. Given the frequently used standards, there is a good chance that bad actors can guess usernames just based off an individual’s first and last name. Can you stop bots from trying to access your Cloud environment? Unlikely, and if you could, the options are limited. After all, the whole point of Cloud is the ability to access it anywhere. All you can really do is minimize risk by requiring internal employees to use things like multi-factor authentication, biometric data or VPN. At least this way even if a password was obtained a bad actor will have difficulty with the next layer of security. In this type of situation though, alerting on failed authentications alone is not the best approach and creates a lot of noise. Instead, what teams might start to do is correlate authentication failures with users who do not have multi-factor enabled, thereby paying more attention to those who are at greater risk of a compromised account. These queries evolve through re-iteration and fine-turning, something which many organizations continue to under invest in.

Challenge: The Need to Understand Data & Prioritize

Before threats and anomalies can be detected accurately and efforts divided appropriately, teams have to understand their data. For example, if the organization uses multi-factor, how does that impact authentication logs? Which event codes are teams interested in for failed authentications on domain controllers? Is there a list of assets and identities, so teams can at least prioritize investigations for critical assets and personnel with privileged access?

A good example of the need to understand data is multi-factor and authentication events. Let’s say an individual is based out of Seattle and accessing AWS infrastructure requiring Okta multi-factor authentication. The first login attempt will come from an IP in Seattle, but the multi-factor authentication event is generated in Virginia. These two authentication events happen within seconds of each other. A SIEM may trigger an alert for this user because it is impossible for the user to be in both Seattle and Virginia in the given timeframe. Therefore, logic has to built in to the SIEM, so this type of activity is taken into consideration and teams are not alerted.

Challenge: The Security Analyst Skills Gap

Have you ever met an IT, security or dev ops team with too little work or spare time? I personally have not. Most of the time there is too much work and not enough of the right people. Without the right people, projects and tasks get prolonged. As a result, the costs and risks only rise overtime. Finding the right people is a common problem and not one just faced by the security industry, but there is a clearly a gap in the positions available and the skills in the workforce.

Challenge: Marketing Hype Has Taken Over

We hear the words all the time. Machine Learning. Artificial Intelligence. Data Scientists. How many true data scientists have you met? How many organizations are utilizing machine learning outside of manufacturing, telematics and smart buildings? Success stories are presented to us everywhere, but the amount of effort to get to that level of maturity is immense and there is still a lot of work to be done for high levels of automation to become the norm in the security realm.

In most cases, organizations are looking at data for the first time and leveraging new platforms for the first time. They still do not know what normal behaviour looks like in order to determine an event as an anomaly. Even then, how many organizations can efficiently go through a year’s worth of data to baseline behaviour? Do they have the processing power? Can it scale out to the entire organization? Although there is some success a turnkey solution really does not exist. Each organization is unique. It takes time, the right culture, roadmap planning and the right leadership to get to the next level.

Challenge: How Do You Centralize Logs? Understanding the Complete Picture

In order to accomplish sophisticated threat hunting and anomaly detection a number of different data sources must be correlated to understand the complete picture. These sources include AD logs, firewall logs, authentication, VPN, DHCP, network flow, etc. Many of these are high volume data sources so how will people analyze the information efficiently? Organizations have turned to SIEMs to accomplish this. Although SIEMs work well in smaller environments, scaling out appropriately is a significant challenge due to data volumes, a lack of resources (both people and infrastructure) and the lack of training and education for users and senior management.

In most cases, a security investigation begins and analysts start to realize there are missing pieces and missing data sets to get the complete picture of what is happening. At which point, additional data sources must be on-boarded and the fine-tuning process starts again.

Wrap Up and FloCon Presentations

This posting highlights some of the data challenges that are facing security teams today. These challenges are present in all industry verticals, but with the right people and direction companies can begin to mature and automate processes to identify threats and anomalies efficiently. Oh, and did I mention, with our industry leading, security data and Splunk expertise, Discovered Intelligence can help with this!

Overall FloCon was a great learning experience and I hope to be able to attend again some time in the future. The FloCon 2019 presentations are available for review and download here: https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=540074



Looking to expedite your success with Splunk? Click here to view our Splunk Professional Service offerings.

© Discovered Intelligence Inc., 2019. Unauthorised use and/or duplication of this material without express and written permission from this site’s owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Discovered Intelligence, with appropriate and specific direction (i.e. a linked URL) to this original content.

Splunk Enterprise 7.2 New Features

Splunk Enterprise 7.2 is the latest release from Splunk and was made available during Splunk .conf18 in Orlando. Many new features were added which will improve Splunk Enterprise from administration and user experience, to analytics and data onboarding. Read more

Help Getting Started with Splunk

Splunk is a great data intelligence platform when used effectively. With a full understanding of Splunk’s functionality and capabilities, it should totally consume you with it’s awesomeness and you will find yourself preaching its benefits to your entire company! Our customers are always asking for recommendations on how to better grasp the fundamentals of the platform and the following article should provide this guidance. Read more

Forecasting Time Series Data Using Splunk Machine Learning Toolkit – Part I

In this blog we will begin to show how Splunk and the Machine Learning Toolkit can be used with time series data, which is generally the most common type of data that is kept in Splunk! Read more

Splunk Operational Intelligence Cookbook – Third Edition – Now Available

Looking to master your Operational data? Authored by leading experts from Discovered Intelligence; the Third Edition of the Splunk Operational Intelligence Cookbook has been completely refreshed for Splunk 7.1 and provides hands-on, easy to follow recipes that will have you mastering Splunk and discovering new insights from your operational data in no time. Leveraging our years of expertise, the book is filled with best practices and packed with content, that will get you hands-on with Splunk right from the first chapter.

The book is published and you can order your copy right now!

 

Read more