Get Multimodal Embeddings
Generates semantic embeddings for text, images, or both using Google Vertex AI's multimodal embedding models for cross-modal search and similarity.
Common Properties
- Name - The custom name of the node.
- Color - The custom color of the node.
- Delay Before (sec) - Waits in seconds before executing the node.
- Delay After (sec) - Waits in seconds after executing node.
- Continue On Error - Automation will continue regardless of any error. The default value is false.
info
If the ContinueOnError property is true, no error is caught when the project is executed, even if a Catch node is used.
Inputs
- Connection Id - Vertex AI client session identifier from Connect node (optional if credentials provided directly).
- Credentials - Google Cloud service account credentials (optional if using Connection ID).
- Project Id - Google Cloud Project ID (required if using direct credentials).
- Text - Text content to generate embeddings for (optional if image provided).
- Target Image Path - Local file path to the image for embedding generation (optional if text provided).
Options
Model Configuration
- Model - Multimodal embedding model to use. Default is "multimodalembedding@001".
Endpoint Configuration
- Locations - Google Cloud region for the Vertex AI endpoint. Default is "us-central1".
- Publishers - Model publisher (typically "google"). Default is "google".
Output
- Response - Full API response object containing embedding vectors.
Response structure:
{
"predictions": [
{
"imageEmbedding": [0.123, -0.456, 0.789, ...],
"textEmbedding": [0.234, -0.567, 0.890, ...]
}
]
}
How It Works
The Get Multimodal Embeddings node generates embeddings that can represent text, images, or both in a unified vector space. When executed:
- Validates connection (either via Connection ID or direct credentials)
- Retrieves authentication token and project ID
- Validates that at least one input (text or image) is provided
- If image path provided:
- Reads image file from local filesystem
- Encodes image as base64 string
- Constructs request payload based on inputs:
- Text + Image: Both embeddings in unified space
- Image only: Image embedding
- Text only: Returns error (recommends using Get Text Embeddings instead)
- Sends POST request to Vertex AI predict endpoint
- Processes response and returns embedding vectors
- Returns complete response object with embeddings
The multimodal model projects text and images into the same vector space, enabling cross-modal similarity search (e.g., find images similar to a text query).
Requirements
- Either:
- Connection ID from Connect node, OR
- Direct credentials + Project ID
- At least one of:
- Text input, OR
- Valid image file path
- Image requirements (if provided):
- Supported formats: JPEG, PNG, GIF, BMP, WebP
- File must be accessible on local filesystem
- Recommended max size: 5MB
- Vertex AI API enabled in Google Cloud project
- IAM permissions:
aiplatform.endpoints.predict
Error Handling
Common errors and solutions:
| Error | Cause | Solution |
|---|---|---|
| ErrInvalidArg | Both text and image empty | Provide at least one input (text or image) |
| ErrEmptyArg | Text provided without image | Use Get Text Embeddings node for text-only |
| File not found | Invalid image path | Verify image file path exists and is accessible |
| ErrInvalidArg | Connection ID or credentials missing | Use Connect node or provide credentials |
| ErrNotFound | Connection not found | Verify Connection ID from Connect node |
| ErrStatus | API error (quota, permissions) | Check Google Cloud Console for API status |
| Invalid image format | Unsupported image type | Convert to JPEG, PNG, GIF, BMP, or WebP |
Example Use Cases
Image-to-Text Search
Scenario: Search product descriptions by uploading a photo
1. Connect to Vertex AI
2. Pre-generate text embeddings for all product descriptions
3. User uploads product photo
4. Get Multimodal Embeddings (image only)
5. Compare image embedding with stored text embeddings
6. Return products with highest similarity
Visual Product Catalog
Use Case: Find similar products by image + description
Setup:
1. For each product:
- Text: Product title + description
- Image: Product photo
- Generate multimodal embedding
- Store in vector database
Search:
1. User provides text query or image
2. Generate embedding for query
3. Find nearest neighbors in catalog
4. Return similar products
Content Moderation
Process: Flag inappropriate image-text pairs
1. Get Multimodal Embeddings for flagged content examples
2. For new user-generated content:
- Generate multimodal embedding (image + caption)
- Calculate similarity with flagged examples
- Flag if similarity exceeds threshold
Cross-Modal Recommendations
Scenario: "Find images like this text description"
1. Generate text embedding for description
2. Compare with pre-computed image embeddings
3. Return most similar images
Or reverse: "Find descriptions for this image"
Duplicate Detection
Use Case: Find duplicate listings with different images/text
1. Generate multimodal embeddings for all listings
2. Calculate pairwise similarities
3. Identify high-similarity pairs (>90%)
4. Flag potential duplicates for review
Tips
- Text + Image: Use both for best semantic representation
- Image Only: Best for visual similarity search
- Text Only: Use Get Text Embeddings node instead (better optimized)
- Preprocessing: Resize large images before embedding to reduce latency
- Batch Processing: Reuse connection for multiple requests
- Vector Database: Store embeddings in specialized vector DB (Pinecone, Weaviate)
- Similarity Metric: Use cosine similarity for comparing embeddings
- Normalization: Embeddings are normalized for efficient comparison
- Cross-Modal: Text and image embeddings share the same vector space
Common Patterns
Cross-Modal Similarity Search
// Compare text query with image embeddings
const textEmbedding = textResponse.predictions[0].textEmbedding;
const imageEmbedding = imageResponse.predictions[0].imageEmbedding;
// Cosine similarity (embeddings are pre-normalized)
const similarity = dotProduct(textEmbedding, imageEmbedding);
function dotProduct(vec1, vec2) {
return vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
}
Hybrid Search Implementation
Combine text and image search:
1. Generate embeddings for query (text + optional image)
2. Split into text and image components
3. Search text embeddings (if query has text)
4. Search image embeddings (if query has image)
5. Merge and rank results by combined similarity
Image Specifications
Supported Formats
- JPEG/JPG: Most common, good compression
- PNG: Lossless, good for graphics
- GIF: Animated support (first frame used)
- BMP: Uncompressed
- WebP: Modern format, good compression
Recommended Settings
- Resolution: 224x224 to 1024x1024 pixels
- File Size: Under 5MB for best performance
- Aspect Ratio: Any (model handles resizing)
- Color Mode: RGB or grayscale
Preprocessing Tips
- Crop/resize oversized images
- Convert unsupported formats to JPEG
- Remove image metadata to reduce size
- Compress high-resolution images
Performance Optimization
- Image Optimization: Compress and resize before API call
- Caching: Cache embeddings for frequently accessed content
- Connection Reuse: One connection for multiple requests
- Regional Endpoints: Use closest region to reduce latency
- Async Processing: Process embeddings in parallel when possible
- Batch Strategy: Process similar items together
Best Practices
- Validate image file exists before calling API
- Use try-catch for file read operations
- Store model version with embeddings
- Implement retry logic for transient errors
- Monitor API usage and costs
- Test with sample data first
- Document similarity thresholds for your use case
- Use appropriate embedding type (text vs multimodal) for your needs
- Consider embedding dimensionality in storage planning
- Implement proper error handling for file operations