I have a Firebase Cloud Function that is triggered by an onCreate
at a path in my Realtime database. Yesterday, after doing some basic testing, I began to get "quota exceeded" errors in the Cloud Functions Log. What was troubling though, is that the error kept coming, first every 2 seconds or so, then ramping up to every 8 seconds or so, for about 1.5 hours.
Here's a small segment of the trouble:
I took a look at the documentation covering Retrying Asynchronous Functions and it seems clear that in most cases a function will stop executing and the event will be discarded if an error occurs. The triggering event did not seem to be getting cleared in my case, perhaps because the error I was getting was coming from outside of my function. Here's the full error text:
Error: quota exceeded (DNS resolutions : per 100 seconds); to increase quotas, enable billing in your project at https://console.cloud.google.com/billing?project=myproject. Function cannot be executed.
Yes, I realize that a quota issue is easily solved by moving to the Blaze plan, which I will be doing soon, but before I do, I'd like to understand how to handle this case should it happen in the future. I had to re-deploy my functions to get the Error to stop happening. After studying the docs, it sounds as though I also could have deleted the function to stop the error, but neither re-deploying or deleting functions seems like a great path to take once my app reaches production.
My thought is that I must have been stuck in a retry loop internal to the Firebase Cloud function service. None of my logging was being hit (some of which would happen on function start, and some during a function run), plus the message says "Function cannot be executed", which to me means that the Error happened before my code was ever executed. Another function which is triggered by the same event was logging an identical Error.
So, to anyone in the know, my questions are as follows:
Finally, a few further notes to address follow-up questions some may have:
Yes, the repeating Error did prevent any new invocations of the function in question to take place. Other functions could still be executed, except for the one other that is triggered by the same event.
Yes, I can recreate the behavior to an extent. If I set the DNS Resolutions quota per 100 seconds to a low value (5, for example), and do a few executions, I get the same error, repeating every 8 seconds or so. Oddly though, when I recreate in this manner, it recovers on its own after about 10-20 throws, seemingly around the time the 100 seconds has elapsed, which makes sense. In the case of the original incident, the errors repeated for more than an hour, at which point I decided to re-deploy, stopping them.
Yes, I do see the most recent Function execution took ### ms, finished with status 'ok'
message in my logs before the Errors start rolling in. After the Errors are done repeating, I do see a the last-in triggering data get processed by the function successfully. This is what caused me to wonder if it was the triggering event that was causing Cloud Functions to keep trying.
No, my function does not write to a location that would retrigger itself.
Yes, I realize I could easily resolve this by moving to the Blaze plan to get a much higher quota. I will be doing that. First though, I'd like to understand the mechanics of what has gone wrong so I can make any available improvements.
By default, the memory allocated for a function is 256MB or 256 MiB depending on the Cloud Functions product version. See Cloud Functions Pricing for information about costs associated with memory and CPU allocations.
Time Limits60 minutes for HTTP functions. 10 minutes for event-driven functions.
By default each Cloud Run container instance can receive up to 80 requests at the same time; you can increase this to a maximum of 1000. Although you should use the default value, if needed you can lower the maximum concurrency. For example, if your code cannot process parallel requests, set concurrency to 1 .
onRequest creates a standard API endpoint, and you'll use whatever methods your client-side code normally uses to make. HTTP requests to interact with them. onCall creates a callable. Once you get used to them, onCall is less effort to write, but you don't have all the flexibility you might be used to.
After going back and forth with Firebase support (who were excellent, btw) for the past month, I finally have some answers to my questions. I thought I'd go ahead and answer my own question, hopefully to benefit others who may encounter a similar issue.
First, an important tidbit that I've learned along the way:
In the case of quota errors, the triggering event of an asynchronous Firebase Cloud Function is NOT removed from the event queue. This is different behaviour than other types of errors which DO remove the triggering event from the queue. According to the Firebase engineers, this approach to handling quota errors is by design, primarily to address the case where it is a per 100 second quota that is being hit. In these cases, it may often make sense to retry as a successful run may be < 100 seconds away.
So, to answer my specific questions...
For cases where functions seem to be stuck in a loop, is redeployment or deletion the ONLY path to recovery? What other possible approaches might I have taken?
Redeployment or deletion is the only option. If a function is stuck in an Error loop, redeploying the function will clear the event queue, resolving the issue. I was able to test this resolution approach, and it worked just fine. The Firebase engineers were also able to confirm that a partial deploy is perfectly acceptable if only a subset of the functions are encountering quota errors.
This resolution approach strikes me as being a bit kludgy, so hopefully future improvements to the Cloud Functions platform will provide an option for developers to manually clear a function's event queue. Fingers crossed.
Is there a way I can handle an Error that is coming from outside my implementation? Can an Error like this be raised by the service or do Errors ONLY originate in my code?
Basically, NO. The quota errors originate on the service, not in the deployed code, so they can't be caught and handled in code. For cases like mine, the best readily available option is to make use of the platform's excellent error reporting features to identify issues early and take corrective action.
Some other notes...
For anyone wanting to see this behaviour themselves, here's a copy of the MCVE I provided to Firebase Support which can be used to replicate the behaviour.
index.js
const functions = require('firebase-functions');
const admin = require('firebase-admin');
admin.initializeApp(functions.config().firebase);
const database = admin.database();
exports.helloWorld = functions.database.ref("/requests/" + "{pushKey}")
.onCreate(event => {
console.log("helloWorld: Triggered with pushKey: ", event.params.pushKey);
//Say hello
console.log("Hello, world! Don't panic!");
let result = {};
result["name"] = event.data.val().name;
return database.ref("/responses/" + `${event.params.pushKey}/`).set(result)
.then(() => {
console.log("Response written to: " + `${event.params.pushKey}/`);
return;
})
.catch((writeError) => {
console.error("Response write failed at: " + `${event.params.pushKey}/`);
return;
});
});
package.json
{
"name": "functions",
"description": "Cloud Functions for Firebase",
"scripts": {
"lint": "./node_modules/.bin/eslint .",
"serve": "firebase serve --only functions",
"shell": "firebase experimental:functions:shell",
"start": "npm run shell",
"deploy": "firebase deploy --only functions",
"logs": "firebase functions:log"
},
"dependencies": {
"firebase-admin": "~5.8.1",
"firebase-functions": "^0.8.1"
},
"devDependencies": {
"eslint": "^4.12.0",
"eslint-plugin-promise": "^3.6.0"
},
"private": true
}
Input data
{
"requests" : [ null, {
"name" : "Ford Prefect"
}, {
"name" : "Arthur Dent"
}, {
"name" : "Trillian"
}, {
"name" : "Zaphod Beeblebrox"
}, {
"name" : "Marvin"
}, {
"name" : "Slartibartfast"
}, {
"name" : "Deep Thought"
}, {
"name" : "Zarniwoop"
}, {
"name" : "Fenchurch"
}, {
"name" : "Babel Fish"
}, {
"name" : "Tricia McMillan"
} ]
}
To test, just use the GCP Console to turn down the quota limit for function invocations per 100 seconds for your Firebase project. Set it to a low value, like 4. Then, import the test data and watch the Errors roll in. Several invocations will succeed (not always 4 of them), then a batch of repeating errors will come in until the 100 second quota rolls over, then you'll get more successful results. Interestingly, Google doesn't seem to use a FIFO queue for the events, so if you enter the records manually, don't expect them to necessarily be processed in the order they were entered.
All said, under a paid plan (which I've already upgraded to), the quotas are quite high so it's unlikely I'll crash into this issue again. However, it was still good to learn about how quota errors are handled in case I need to engineer items that may hit a quota in the future.
Thanks again to the great folks at Firebase Support for all the help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With