Portkey provides a robust and secure gateway to facilitate the integration of various Large Language Models (LLMs), and embedding models into your apps, including Google Vertex AI.
With Portkey, you can take advantage of features like fast AI gateway access, observability, prompt management, and more, all while ensuring the secure management of your Vertex auth through a virtual key system
To integrate Vertex AI with Portkey, you’ll need your Vertex Project Id Or Service Account JSON & Vertex Region, with which you can set up the Virtual key.
import Portkey from 'portkey-ai'const portkey = new Portkey({ apiKey: "PORTKEY_API_KEY", // defaults to process.env["PORTKEY_API_KEY"] virtualKey: "VERTEX_VIRTUAL_KEY", // Your Vertex AI Virtual Key})
import Portkey from 'portkey-ai'const portkey = new Portkey({ apiKey: "PORTKEY_API_KEY", // defaults to process.env["PORTKEY_API_KEY"] virtualKey: "VERTEX_VIRTUAL_KEY", // Your Vertex AI Virtual Key})
from portkey_ai import Portkeyportkey = Portkey( api_key="PORTKEY_API_KEY", # Replace with your Portkey API key virtual_key="VERTEX_VIRTUAL_KEY" # Replace with your virtual key for Google)
If you do not want to add your Vertex AI details to Portkey vault, you can directly pass them while instantiating the Portkey client. More on that here.
3. Invoke Chat Completions with Vertex AI and Gemini
Use the Portkey instance to send requests to Gemini models hosted on Vertex AI. You can also override the virtual key directly in the API call if needed.
Vertex AI uses OAuth2 to authenticate its requests, so you need to send the access token additionally along with the request.
const chatCompletion = await portkey.chat.completions.create({ messages: [{ role: 'user', content: 'Say this is a test' }], model: 'gemini-1.5-pro-latest',}, {Authorization: "Bearer $YOUR_VERTEX_ACCESS_TOKEN"});console.log(chatCompletion.choices);
const chatCompletion = await portkey.chat.completions.create({ messages: [{ role: 'user', content: 'Say this is a test' }], model: 'gemini-1.5-pro-latest',}, {Authorization: "Bearer $YOUR_VERTEX_ACCESS_TOKEN"});console.log(chatCompletion.choices);
completion = portkey.with_options(Authorization="Bearer $YOUR_VERTEX_ACCESS_TOKEN").chat.completions.create( messages= [{ "role": 'user', "content": 'Say this is a test' }], model= 'gemini-1.5-pro-latest')print(completion)
To use Anthopic models on Vertex AI, prepend anthropic. to the model name.
Example: anthropic.claude-3-5-sonnet@20240620
Similarly, for Meta models, prepend meta. to the model name.
Example: meta.llama-3-8b-8192
Using Self-Deployed Models on Vertex AI (Hugging Face, Custom Models)
Portkey supports connecting to self-deployed models on Vertex AI, including models from Hugging Face or any custom models you’ve deployed to a Vertex AI endpoint.
Requirements for Self-Deployed Models
To use self-deployed models on Vertex AI through Portkey:
Model Naming Convention: When making requests to your self-deployed model, you must prefix the model name with endpoints.
endpoints.my_endpoint_name
Required Permissions: The Google Cloud service account used in your Portkey virtual key must have the aiplatform.endpoints.predict permission.
const chatCompletion = await portkey.chat.completions.create({ messages: [{ role: 'user', content: 'Say this is a test' }], model: 'endpoints.my_custom_llm', // Notice the 'endpoints.' prefix}, {Authorization: "Bearer $YOUR_VERTEX_ACCESS_TOKEN"});console.log(chatCompletion.choices);
const chatCompletion = await portkey.chat.completions.create({ messages: [{ role: 'user', content: 'Say this is a test' }], model: 'endpoints.my_custom_llm', // Notice the 'endpoints.' prefix}, {Authorization: "Bearer $YOUR_VERTEX_ACCESS_TOKEN"});console.log(chatCompletion.choices);
completion = portkey.with_options(Authorization="Bearer $YOUR_VERTEX_ACCESS_TOKEN").chat.completions.create( messages= [{ "role": 'user', "content": 'Say this is a test' }], model= 'endpoints.my_huggingface_model' # Notice the 'endpoints.' prefix)print(completion)
Why the prefix? Vertex AI’s product offering for self-deployed models is called “Endpoints.” This naming convention indicates to Portkey that it should route requests to your custom endpoint rather than a standard Vertex AI model.
This approach works for all models you can self-deploy on Vertex AI Model Garden, including Hugging Face models and your own custom models.
The assistants thinking response is returned in the response_chunk.choices[0].delta.content_blocks array, not the response.choices[0].message.content string.
Gemini models do not support plugging back the reasoning into multi turn conversations, so you don’t need to send the thinking message back to the model.
Models like google.gemini-2.5-flash-preview-04-17anthropic.claude-3-7-sonnet@20250219 support extended thinking.
This is similar to openai thinking, but you get the model’s reasoning as it processes the request as well.
from portkey_ai import Portkey# Initialize the Portkey clientportkey = Portkey( api_key="PORTKEY_API_KEY", # Replace with your Portkey API key virtual_key="VIRTUAL_KEY", # Add your provider's virtual key strict_open_ai_compliance=False)# Create the requestresponse = portkey.chat.completions.create( model="anthropic.claude-3-7-sonnet@20250219", max_tokens=3000, thinking={ "type": "enabled", "budget_tokens": 2030 }, stream=True, messages=[ { "role": "user", "content": [ { "type": "text", "text": "when does the flight from new york to bengaluru land tomorrow, what time, what is its flight number, and what is its baggage belt?" } ] } ])print(response)# in case of streaming responses you'd have to parse the response_chunk.choices[0].delta.content_blocks array# response = portkey.chat.completions.create(# ...same config as above but with stream: true# )# for chunk in response:# if chunk.choices[0].delta:# content_blocks = chunk.choices[0].delta.get("content_blocks")# if content_blocks is not None:# for content_block in content_blocks:# print(content_block)
To disable thinking for gemini models like google.gemini-2.5-flash-preview-04-17, you are required to explicitly set budget_tokens to 0.
from portkey_ai import Portkey# Initialize the Portkey clientportkey = Portkey( api_key="PORTKEY_API_KEY", # Replace with your Portkey API key virtual_key="VIRTUAL_KEY", # Add your provider's virtual key strict_open_ai_compliance=False)# Create the requestresponse = portkey.chat.completions.create( model="anthropic.claude-3-7-sonnet@20250219", max_tokens=3000, thinking={ "type": "enabled", "budget_tokens": 2030 }, stream=True, messages=[ { "role": "user", "content": [ { "type": "text", "text": "when does the flight from baroda to bangalore land tomorrow, what time, what is its flight number, and what is its baggage belt?" } ] }, { "role": "assistant", "content": [ { "type": "thinking", "thinking": "The user is asking several questions about a flight from Baroda (also known as Vadodara) to Bangalore:\n1. When does the flight land tomorrow\n2. What time does it land\n3. What is the flight number\n4. What is the baggage belt number at the arrival airport\n\nTo properly answer these questions, I would need access to airline flight schedules and airport information systems. However, I don't have:\n- Real-time or scheduled flight information\n- Access to airport baggage claim allocation systems\n- Information about specific flights between these cities\n- The ability to look up tomorrow's specific flight schedules\n\nThis question requires current, specific flight information that I don't have access to. Instead of guessing or providing potentially incorrect information, I should explain this limitation and suggest ways the user could find this information.", "signature": "EqoBCkgIARABGAIiQBVA7FBNLRtWarDSy9TAjwtOpcTSYHJ+2GYEoaorq3V+d3eapde04bvEfykD/66xZXjJ5yyqogJ8DEkNMotspRsSDKzuUJ9FKhSNt/3PdxoMaFZuH+1z1aLF8OeQIjCrA1+T2lsErrbgrve6eDWeMvP+1sqVqv/JcIn1jOmuzrPi2tNz5M0oqkOO9txJf7QqEPPw6RG3JLO2h7nV1BMN6wE=" } ] }, { "role": "user", "content": "thanks that's good to know, how about to chennai?" } ])print(response)
This same message format also works for all other media types — just send your media file in the url field, like "url": "gs://cloud-samples-data/video/animals.mp4" for google cloud urls and "url":"https://download.samplelib.com/mp3/sample-3s.mp3" for public urls
Your URL should have the file extension, this is used for inferring MIME_TYPE which is a required parameter for prompting Gemini models with files
You can manage all prompts to Google Gemini in the Prompt Library. All the models in the model garden are supported and you can easily start testing different prompts.
Once you’re ready with your prompt, you can use the portkey.prompts.completions.create interface to use the prompt in your application.
Vertex AI supports grounding with Google Search. This is a feature that allows you to ground your LLM responses with real-time search results.
Grounding is invoked by passing the google_search tool (for newer models like gemini-2.0-flash-001), and google_search_retrieval (for older models like gemini-1.5-flash) in the tools array.
"tools": [ { "type": "function", "function": { "name": "google_search" // or google_search_retrieval for older models } }]
If you mix regular tools with grounding tools, vertex might throw an error saying only one tool can be used at a time.
gemini-2.0-flash-thinking-exp and other thinking/reasoning models
gemini-2.0-flash-thinking-exp models return a Chain of Thought response along with the actual inference text,
this is not openai compatible, however, Portkey supports this by adding a \r\n\r\n and appending the two responses together.
You can split the response along this pattern to get the Chain of Thought response and the actual inference text.
If you require the Chain of Thought response along with the actual inference text, pass the strict open ai compliance flag as false in the request.
You can also pass your Vertex AI details & secrets directly without using the Virtual Keys in Portkey.
Vertex AI expects a region, a project ID and the access token in the request for a successful completion request. This is how you can specify these fields directly in your requests:
When selecting Service Account File as your authentication method, you’ll need to:
Upload your Google Cloud service account JSON file
Specify the Vertex Region
This method is particularly important for using self-deployed models, as your service account must have the aiplatform.endpoints.predict permission to access custom endpoints.
Learn more about permission on your Vertex IAM key here.
For Self-Deployed Models: Your service account must have the aiplatform.endpoints.predict permission in Google Cloud IAM. Without this specific permission, requests to custom endpoints will fail.