I have a fixed position (column) file, where there is no delimiter which separates the fields. Each field has its own start position and length. Here is the example of the data:
520140914191193386---------7661705508623855646---1595852965---133437--the lazy fox jumping over-----------------------212.75.12.85---
While I used dashes (-) to show the sample of the data above, the actual file contains spaces if the actual field is shorter than allowed in schema.
The schema in this case is:
UsedID (start position 1, length 27)
SystemID (start position 28, length 22)
SampleID (start position 50, length 13)
LineID (start position 63, length 8)
Text (start position 71, length 48)
IP (start position119, length 15)
Ideally, I would get the following field values in logstash (without trailing spaces)
UsedID:520140914191193386
SystemID:7661705508623855646
SampleID:1595852965
LineID:133437
Text:the lazy fox jumping over
IP:212.75.12.85
How do I parse this kind of file with grok?
Grok has separate IPv4 and IPv6 patterns, but they can be filtered together with the syntax IP. Again, just use the IP syntax, unless for any reason you want to separate these respective addresses into separate fields. This grok pattern will match the regex of 22-22-22 (or any other digit) to the field name.
Grok is a great way to parse unstructured log data into something structured and queryable. This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption.
The regex parser uses named groups in regular expressions to extract field values from each line of text. You can use grok syntax (i.e. %{PATTERN_NAME:field_name} ) to build complex expressions taking advantage of the built-in patterns provided by Panther or by defining your own.
Open the main menu, click Dev Tools, then click Grok Debugger. In Grok Pattern, enter the grok pattern that you want to apply to the data. Click Simulate. You'll see the simulated event that results from applying the grok pattern.
I'd go for a two-step process:
Since each field has a known length, you can use a regex pattern like .{27}
to match them.
In grok, you can name a field like so: (?<user_id>.{27})
You can test a full pattern in the grok debugger, but something like this should achieve a length-based split:
(?<user_id>.{27})(?<system_id>.{22})(?<sample_id>.{13})(?<line_id>.{8})(?<text>.{48})(?<ip>.{15})
You mentioned that your extra characters are all whitespace, so you can clean that up using the mutate filter with a strip option.
All together, that might look something like this:
filter {
grok {
match => ["message", "(?<user_id>.{27})(?<system_id>.{22})(?<sample_id>.{13})(?<line_id>.{8})(?<text>.{48})(?<ip>.{15})"]
}
mutate {
strip => [
"user_id",
"system_id",
"sample_id",
"line_id",
"text",
"ip"
]
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With