Get Multimodal Embeddings

Generates semantic embeddings for text, images, or both using Google Vertex AI's multimodal embedding models for cross-modal search and similarity.

Common Properties

Name - The custom name of the node.
Color - The custom color of the node.
Delay Before (sec) - Waits in seconds before executing the node.
Delay After (sec) - Waits in seconds after executing node.
Continue On Error - Automation will continue regardless of any error. The default value is false.

info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

Connection Id - Vertex AI client session identifier from Connect node (optional if credentials provided directly).
Credentials - Google Cloud service account credentials (optional if using Connection ID).
Project Id - Google Cloud Project ID (required if using direct credentials).
Text - Text content to generate embeddings for (optional if image provided).
Target Image Path - Local file path to the image for embedding generation (optional if text provided).

Options

Model Configuration

Model - Multimodal embedding model to use. Default is "multimodalembedding@001".

Endpoint Configuration

Locations - Google Cloud region for the Vertex AI endpoint. Default is "us-central1".
Publishers - Model publisher (typically "google"). Default is "google".

Output

Response - Full API response object containing embedding vectors.

Response structure:

{
  "predictions": [
    {
      "imageEmbedding": [0.123, -0.456, 0.789, ...],
      "textEmbedding": [0.234, -0.567, 0.890, ...]
    }
  ]
}

How It Works

The Get Multimodal Embeddings node generates embeddings that can represent text, images, or both in a unified vector space. When executed:

Validates connection (either via Connection ID or direct credentials)
Retrieves authentication token and project ID
Validates that at least one input (text or image) is provided
If image path provided:
- Reads image file from local filesystem
- Encodes image as base64 string
Constructs request payload based on inputs:
- Text + Image: Both embeddings in unified space
- Image only: Image embedding
- Text only: Returns error (recommends using Get Text Embeddings instead)
Sends POST request to Vertex AI predict endpoint
Processes response and returns embedding vectors
Returns complete response object with embeddings

The multimodal model projects text and images into the same vector space, enabling cross-modal similarity search (e.g., find images similar to a text query).

Requirements

Either:
- Connection ID from Connect node, OR
- Direct credentials + Project ID
At least one of:
- Text input, OR
- Valid image file path
Image requirements (if provided):
- Supported formats: JPEG, PNG, GIF, BMP, WebP
- File must be accessible on local filesystem
- Recommended max size: 5MB
Vertex AI API enabled in Google Cloud project
IAM permissions: aiplatform.endpoints.predict

Error Handling

Common errors and solutions:

Error	Cause	Solution
ErrInvalidArg	Both text and image empty	Provide at least one input (text or image)
ErrEmptyArg	Text provided without image	Use Get Text Embeddings node for text-only
File not found	Invalid image path	Verify image file path exists and is accessible
ErrInvalidArg	Connection ID or credentials missing	Use Connect node or provide credentials
ErrNotFound	Connection not found	Verify Connection ID from Connect node
ErrStatus	API error (quota, permissions)	Check Google Cloud Console for API status
Invalid image format	Unsupported image type	Convert to JPEG, PNG, GIF, BMP, or WebP

Example Use Cases

Image-to-Text Search

Scenario: Search product descriptions by uploading a photo
Connect to Vertex AI
Pre-generate text embeddings for all product descriptions
User uploads product photo
Get Multimodal Embeddings (image only)
Compare image embedding with stored text embeddings
Return products with highest similarity

Visual Product Catalog

Use Case: Find similar products by image + description
Setup:
1. For each product:
   - Text: Product title + description
   - Image: Product photo
   - Generate multimodal embedding
   - Store in vector database
Search:
1. User provides text query or image
2. Generate embedding for query
3. Find nearest neighbors in catalog
4. Return similar products

Content Moderation

Process: Flag inappropriate image-text pairs
1. Get Multimodal Embeddings for flagged content examples
2. For new user-generated content:
   - Generate multimodal embedding (image + caption)
   - Calculate similarity with flagged examples
   - Flag if similarity exceeds threshold

Scenario: "Find images like this text description"
1. Generate text embedding for description
2. Compare with pre-computed image embeddings
3. Return most similar images
Or reverse: "Find descriptions for this image"

Duplicate Detection

Use Case: Find duplicate listings with different images/text
Generate multimodal embeddings for all listings
Calculate pairwise similarities
Identify high-similarity pairs (>90%)
Flag potential duplicates for review

Tips

Text + Image: Use both for best semantic representation
Image Only: Best for visual similarity search
Text Only: Use Get Text Embeddings node instead (better optimized)
Preprocessing: Resize large images before embedding to reduce latency
Batch Processing: Reuse connection for multiple requests
Vector Database: Store embeddings in specialized vector DB (Pinecone, Weaviate)
Similarity Metric: Use cosine similarity for comparing embeddings
Normalization: Embeddings are normalized for efficient comparison
Cross-Modal: Text and image embeddings share the same vector space

Common Patterns

// Compare text query with image embeddings
const textEmbedding = textResponse.predictions[0].textEmbedding;
const imageEmbedding = imageResponse.predictions[0].imageEmbedding;

// Cosine similarity (embeddings are pre-normalized)
const similarity = dotProduct(textEmbedding, imageEmbedding);

function dotProduct(vec1, vec2) {
  return vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
}

Hybrid Search Implementation

Combine text and image search:
Generate embeddings for query (text + optional image)
Split into text and image components
Search text embeddings (if query has text)
Search image embeddings (if query has image)
Merge and rank results by combined similarity

Image Specifications

Supported Formats

JPEG/JPG: Most common, good compression
PNG: Lossless, good for graphics
GIF: Animated support (first frame used)
BMP: Uncompressed
WebP: Modern format, good compression

Recommended Settings

Resolution: 224x224 to 1024x1024 pixels
File Size: Under 5MB for best performance
Aspect Ratio: Any (model handles resizing)
Color Mode: RGB or grayscale

Preprocessing Tips

Crop/resize oversized images
Convert unsupported formats to JPEG
Remove image metadata to reduce size
Compress high-resolution images

Performance Optimization

Image Optimization: Compress and resize before API call
Caching: Cache embeddings for frequently accessed content
Connection Reuse: One connection for multiple requests
Regional Endpoints: Use closest region to reduce latency
Async Processing: Process embeddings in parallel when possible
Batch Strategy: Process similar items together

Best Practices

Validate image file exists before calling API
Use try-catch for file read operations
Store model version with embeddings
Implement retry logic for transient errors
Monitor API usage and costs
Test with sample data first
Document similarity thresholds for your use case
Use appropriate embedding type (text vs multimodal) for your needs
Consider embedding dimensionality in storage planning
Implement proper error handling for file operations

Common Properties​

Inputs​

Options​

Model Configuration​

Endpoint Configuration​

Output​

How It Works​

Requirements​

Error Handling​

Example Use Cases​

Image-to-Text Search​

Visual Product Catalog​

Content Moderation​

Cross-Modal Recommendations​

Duplicate Detection​

Tips​

Common Patterns​

Cross-Modal Similarity Search​

Hybrid Search Implementation​

Image Specifications​

Supported Formats​

Recommended Settings​

Preprocessing Tips​

Performance Optimization​

Best Practices​