I am new to hadoop streaming. I have few filter conditions in my reduce code, I would like to know how many records pass this conditions. I come to know we can do this by writing custom counters. Can some body show point me how to write custom counters?
I am emitting three columns in mapper code, say a,b,c
key is a, and value as list, which is like [b,c]
, To have an example from mapper code, it is like ['I'^['C','P']]
Here is my reduce code.
labels = ["a","b"]
for line in sys.stdin:
l = line.strip().split("^")
key = l[0]
value = l[1]
record = [key] + value
records.append(record)
df = pd.DataFrame.from_records(records,columns=labels)
df = df((df['a'] == 'I') & (df['b'] == 'C'))
I would like to know how many records df contains, at reducer level.
Thank you.
Counters in Hadoop are used to keep track of occurrences of events. In Hadoop, whenever any job gets executed, Hadoop Framework initiates Counter to keep track of job statistics like the number of bytes read, the number of rows read, the number of rows written etc.
Hadoop framework is written in Java language; however, Hadoop programs can be coded in Python or C++ language. We can write programs like MapReduce in Python language, while not the requirement for translating the code into Java jar files.
Hadoop Streaming supports almost all types of programming languages such as Python, C++, Ruby, Perl etc. The entire Hadoop Streaming framework runs on Java. However, the codes might be written in different languages as mentioned in the above point.
You can simply print to stderr:
print >> sys.stderr, "reporter:counter: CUSTOM, NbRecords,1"
This will increment counter "NbRecords" in counters group "CUSTOM" by 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With