Analyzing Syslog files can be easy...
Reading in syslog files is easy, a lot of script languages provides means to do that. But is it also possible to perform a quick analysis request with these languages ?
Your task: Extract the IP addresses out of the text portion and get the access count of every single address.
Dedicated Time: None. Your Boss stands behind you...
The whole solution, only a handful of operators.
The first step was to read in the compressed logfile and convert it into a table.
You can inspect the first part of the resulting table when you hover the mouse over the output connector 'o1' of 'strexplode1'. It is a vector ( table without a column header ) with 100001 rows.
Our next job is to split the strings into columns.
We route the vector into a macro 'GetPriority'. You may notice the lock icon, it marks this macro as an operator class. Operator classes are a great method to create 'reusable code'. If you need to change the behaviour of an operator class, lets says because of an error, you can do that without having the problem to alter all instances of this macro in this or other FlowSheets. You only have to update the FlowSheet once...
Because we can expect four 'spaces' as delimiters, the best fit here is a 'strexplode' operator. It gets the remaining part of the message :
and delivers back this table :
The single parts of the message are now joined together into a new table.
Now we extract the ip address out of the text column. This is also a combination of strplits, no magic.
We attach the IP address to the table.
Now the fun part. We have ~100000 IP's now. In a traditional scripting language you may start here programming loops, we use the operator dcCompressWizard from the DataCube library.
We just need to set two checkboxes and select the accumulation method.
Result : a new table containing the unique IP address along with their access counts.
Your boss is happy ! ( It took only 5:35 minutes )
A big advantage is that you can inspect the results after each step in the processing chain.
You develop WITH and not FOR the data...
Feel your data flow...Download FlowSheet (ZIP archive)