Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keycloak oidc authentication issue on K8s having replica of application server

I am facing an issue with authorization_code grant type on a replica set up in a K8s cluster and seeking for advice and help. My set up is as follows:

  1. 1 instance of Keycloak server running on 1 pod on 1 node.
  2. 2 instances of backend server running on 2 pods on 2 different nodes. (say api1 and api2)

Basically, the problem here is, suppose if the api1 initiates a code verification challenge to the Keycloak during the authentication workflow, and the user after successfully authenticating with Keycloak with a valid username and password, the Keycloak would then invoke the redirectURI of the backend server. However, the redirectURI instead of hitting api1, hits the other instance of backend server api2. And due to this the session state of the Request object for api2 would not have the code_verifier property because of which we are unable to call the /protocol/openid-connect/token api to get the access token.

What I am trying to achieve is either have the redirectURI always hit the same backend server instance that initiated the request OR if there is a way for the backend servers (api1 and api2) to share the sessions so that irrespective of who initiates the request the session will always hold the code_verifier value upon successful authentication with Keycloak. I know this is not a Keycloak specific issue, rather more of K8s thing (I suppose), but if anyone has also encountered this situation before and have managed to do a proper resolution (without compromising HA) then kindly share your knowledge here.

I tried to check if I can attach a sticky session between the Keycloak and backend server so that the redirectURI always hits the same backend server that started the auth request, but unfortunately couldn't find any leads nor any similar problem posted in the community.

Any help or advice is much appreciated. Thanks

like image 251
sjgo Avatar asked Oct 27 '25 00:10

sjgo


2 Answers

Your API seems to be playing the role of a backend for frontend for a browser based app. This type of solution is not specific to either Keycloak or Kubernetes. Consider a login for a web app at a URL of https://www.product.com.

CODE FLOW MESSAGES

The code flow begins with a request like this, with the backend producing two random values, for state and code_verifier. It stores both in a temporary encrypted HTTP-only cookie, then forms an authorization request URL. The frontend then uses this to redirect the browser to the authorization server:

GET https://login.example.com/oauth/v2/authorize?
   client_id=my-web-client&
   redirect_uri=https://www.product.com/callback&
   response_type=code&
   scope=openid profile&
   code_challenge=WhmRaP18B9z2zk...&
   code_challenge_method=S256&
   state=CfDJ8Nxa-YhPzjpBilDQz2C...

The user signs in at the authorization server, which can be done in many possible ways. Then, a response is returned to the frontend, and the backend processes the URL.

GET https://www.product.com/callback?
  code=I9xL9DY9jAYHPuHSiW2OpWUaNRW4otei&
  state=CfDJ8Nxa-YhPzjpBilDQz2C...

At this point the BFF reads the temporary cookie and gets back the two random values generated earlier. It validates the response state, then sends an authorization code grant request, to exchange the code for tokens:

POST https://login.example.com/oauth/v2/token

client_id=my-web-client&
client_secret=***************&
code=I9xL9DY9jAYHPuHSiW2OpWUaNRW4otei&
grant_type=authorization_code&
redirect_uri=http://www.product.com/callback&
code_verifier=HlfffYlGy7SIX3pYHOMJfhnO5AhUW1eOIKfjR42ue28

COOKIE HANDLING

By using cookies, the BFF handles multiple concurrent callers. It is also stateless and easy to manage. There are no inconvenient hosting restrictions such as requiring sticky sessions, which could lead to availability problems if a server fails.

The only requirement is to deploy all instances of the BFF with the same cookie encryption key, so that instance api1 can decrypt cookies issued by instance api2. This might be a value generated as follows, and used with an encryption algorithm such as AES256-GCM:

openssl rand 32 | xxd -p -c 64

COMPONENT ROLES

I'd recommend describing these components differently in an OAuth architecture:

  • APIs are usually resource servers, whose only security responsibilities are to validate access tokens and implement claims based authorization.

  • BFFs act as OAuth clients, to serve browser based apps and keep tokens out of the browser. The usual result is that the browser based app sends each API request with an HTTP-only cookie credential. BFFs are not resource servers, but can be implemented as a utility API. They require a same-parent-domain relationship with the browser based app in order for cookies to be considered first-party, and not dropped because of browser cookie restrictions. Eg. the BFF might run in a domain such as https://api.product.com.

Please post back if I am misunderstanding anything about the question ...

like image 195
Gary Archer Avatar answered Oct 29 '25 16:10

Gary Archer


So for those who are facing this same issue. Here's how I fixed this.

Since the requests to the backend server comes through an Ingress, I used a cookie session affinity provided by the Ingress. I used Ingress-NGINX Controller for Kubernetes.

Below is the configuration I added to the values.yaml of Helm chart in the ingress annotations.

nginx.ingress.kubernetes.io/affinity: 'cookie'
nginx.ingress.kubernetes.io/session-cookie-path: '/'

And since the backend is fully stateless except for the one time during the authentication stage using OIDC authorization_code grant type where it stores the code_verifier value on the session, so we didn't have to worry about the limitations of this approach such as during node restarts, container restart, auto-scaling, resource starvation, new rollouts, load balancing which would impact the logged-in users if the backend was maintaining the other authentication state. We manage those all through cookies so even if the pods are down/destroyed/replaced and new pods can still handle those active sessions.

Here is a link to an article by Paul Dally from where I took some reference associated with the above problem.

like image 34
sjgo Avatar answered Oct 29 '25 17:10

sjgo