Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:

  • We have applications that produce log files ranging from a few MB to hundreds of MB per file.
  • We do not want to stream all the log events as they occur.
  • Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
  • This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
  • Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great

I came up with the following evaluation:

  • Elastic's Logstash and FileBeats
    • Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
    • Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
    • Both tools are proven to be stable in production-level environments
    • Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
  • Apache NiFi w/ MiNiFi
    • The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
    • MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
    • Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
    • Centralized pipeline management within the NiFi UI
    • writing to a HDFS sink is available out-of-the-box
    • Community looks active, development is lead by Hortonworks (?)
    • We have made good experiences with NiFi in the past
  • Apache Flume
    • writing to a HDFS sink is available out-of-the-box
    • Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
    • No centralized pipeline management?
  • Apache Gobblin
    • writing to a HDFS sink is available out-of-the-box
    • No centralized pipeline management?
    • No lightweight edge node "agents"?
  • Fluentd
    • Maybe another tool to look at? Looking for your comments on this one...

I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.

Have I forgotten any broadly used tool that is able to solve this use case?

like image 973
j9dy Avatar asked Nov 08 '22 05:11

j9dy


1 Answers

I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.

  1. I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.

  2. You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?

  3. Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.

Hope it helps! And would be glad to know your final choice and your implementation.

like image 68
wyc Avatar answered Dec 10 '22 04:12

wyc