Skip to main content

Get Multimodal Embeddings

Generates semantic embeddings for text, images, or both using Google Vertex AI's multimodal embedding models for cross-modal search and similarity.

Common Properties

  • Name - The custom name of the node.
  • Color - The custom color of the node.
  • Delay Before (sec) - Waits in seconds before executing the node.
  • Delay After (sec) - Waits in seconds after executing node.
  • Continue On Error - Automation will continue regardless of any error. The default value is false.
info

If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.

Inputs

  • Connection Id - Vertex AI client session identifier from Connect node (optional if credentials provided directly).
  • Credentials - Google Cloud service account credentials (optional if using Connection ID).
  • Project Id - Google Cloud Project ID (required if using direct credentials).
  • Text - Text content to generate embeddings for (optional if image provided).
  • Target Image Path - Local file path to the image for embedding generation (optional if text provided).

Options

Model Configuration

  • Model - Multimodal embedding model to use. Default is "multimodalembedding@001".

Endpoint Configuration

  • Locations - Google Cloud region for the Vertex AI endpoint. Default is "us-central1".
  • Publishers - Model publisher (typically "google"). Default is "google".

Output

  • Response - Full API response object containing embedding vectors.

Response structure:

{
"predictions": [
{
"imageEmbedding": [0.123, -0.456, 0.789, ...],
"textEmbedding": [0.234, -0.567, 0.890, ...]
}
]
}

How It Works

The Get Multimodal Embeddings node generates embeddings that can represent text, images, or both in a unified vector space. When executed:

  1. Validates connection (either via Connection ID or direct credentials)
  2. Retrieves authentication token and project ID
  3. Validates that at least one input (text or image) is provided
  4. If image path provided:
    • Reads image file from local filesystem
    • Encodes image as base64 string
  5. Constructs request payload based on inputs:
    • Text + Image: Both embeddings in unified space
    • Image only: Image embedding
    • Text only: Returns error (recommends using Get Text Embeddings instead)
  6. Sends POST request to Vertex AI predict endpoint
  7. Processes response and returns embedding vectors
  8. Returns complete response object with embeddings

The multimodal model projects text and images into the same vector space, enabling cross-modal similarity search (e.g., find images similar to a text query).

Requirements

  • Either:
    • Connection ID from Connect node, OR
    • Direct credentials + Project ID
  • At least one of:
    • Text input, OR
    • Valid image file path
  • Image requirements (if provided):
    • Supported formats: JPEG, PNG, GIF, BMP, WebP
    • File must be accessible on local filesystem
    • Recommended max size: 5MB
  • Vertex AI API enabled in Google Cloud project
  • IAM permissions: aiplatform.endpoints.predict

Error Handling

Common errors and solutions:

ErrorCauseSolution
ErrInvalidArgBoth text and image emptyProvide at least one input (text or image)
ErrEmptyArgText provided without imageUse Get Text Embeddings node for text-only
File not foundInvalid image pathVerify image file path exists and is accessible
ErrInvalidArgConnection ID or credentials missingUse Connect node or provide credentials
ErrNotFoundConnection not foundVerify Connection ID from Connect node
ErrStatusAPI error (quota, permissions)Check Google Cloud Console for API status
Invalid image formatUnsupported image typeConvert to JPEG, PNG, GIF, BMP, or WebP

Example Use Cases

Scenario: Search product descriptions by uploading a photo
1. Connect to Vertex AI
2. Pre-generate text embeddings for all product descriptions
3. User uploads product photo
4. Get Multimodal Embeddings (image only)
5. Compare image embedding with stored text embeddings
6. Return products with highest similarity

Visual Product Catalog

Use Case: Find similar products by image + description
Setup:
1. For each product:
- Text: Product title + description
- Image: Product photo
- Generate multimodal embedding
- Store in vector database
Search:
1. User provides text query or image
2. Generate embedding for query
3. Find nearest neighbors in catalog
4. Return similar products

Content Moderation

Process: Flag inappropriate image-text pairs
1. Get Multimodal Embeddings for flagged content examples
2. For new user-generated content:
- Generate multimodal embedding (image + caption)
- Calculate similarity with flagged examples
- Flag if similarity exceeds threshold

Cross-Modal Recommendations

Scenario: "Find images like this text description"
1. Generate text embedding for description
2. Compare with pre-computed image embeddings
3. Return most similar images
Or reverse: "Find descriptions for this image"

Duplicate Detection

Use Case: Find duplicate listings with different images/text
1. Generate multimodal embeddings for all listings
2. Calculate pairwise similarities
3. Identify high-similarity pairs (>90%)
4. Flag potential duplicates for review

Tips

  • Text + Image: Use both for best semantic representation
  • Image Only: Best for visual similarity search
  • Text Only: Use Get Text Embeddings node instead (better optimized)
  • Preprocessing: Resize large images before embedding to reduce latency
  • Batch Processing: Reuse connection for multiple requests
  • Vector Database: Store embeddings in specialized vector DB (Pinecone, Weaviate)
  • Similarity Metric: Use cosine similarity for comparing embeddings
  • Normalization: Embeddings are normalized for efficient comparison
  • Cross-Modal: Text and image embeddings share the same vector space

Common Patterns

// Compare text query with image embeddings
const textEmbedding = textResponse.predictions[0].textEmbedding;
const imageEmbedding = imageResponse.predictions[0].imageEmbedding;

// Cosine similarity (embeddings are pre-normalized)
const similarity = dotProduct(textEmbedding, imageEmbedding);

function dotProduct(vec1, vec2) {
return vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
}

Hybrid Search Implementation

Combine text and image search:
1. Generate embeddings for query (text + optional image)
2. Split into text and image components
3. Search text embeddings (if query has text)
4. Search image embeddings (if query has image)
5. Merge and rank results by combined similarity

Image Specifications

Supported Formats

  • JPEG/JPG: Most common, good compression
  • PNG: Lossless, good for graphics
  • GIF: Animated support (first frame used)
  • BMP: Uncompressed
  • WebP: Modern format, good compression
  • Resolution: 224x224 to 1024x1024 pixels
  • File Size: Under 5MB for best performance
  • Aspect Ratio: Any (model handles resizing)
  • Color Mode: RGB or grayscale

Preprocessing Tips

  • Crop/resize oversized images
  • Convert unsupported formats to JPEG
  • Remove image metadata to reduce size
  • Compress high-resolution images

Performance Optimization

  • Image Optimization: Compress and resize before API call
  • Caching: Cache embeddings for frequently accessed content
  • Connection Reuse: One connection for multiple requests
  • Regional Endpoints: Use closest region to reduce latency
  • Async Processing: Process embeddings in parallel when possible
  • Batch Strategy: Process similar items together

Best Practices

  • Validate image file exists before calling API
  • Use try-catch for file read operations
  • Store model version with embeddings
  • Implement retry logic for transient errors
  • Monitor API usage and costs
  • Test with sample data first
  • Document similarity thresholds for your use case
  • Use appropriate embedding type (text vs multimodal) for your needs
  • Consider embedding dimensionality in storage planning
  • Implement proper error handling for file operations