Visual search 101: what happens when a shopper uploads a photo.
A shopper sees someone's handbag on Instagram. They don't know the brand. They can't describe the pattern. They have a picture. This is the exact query keyword search can't answer — and it's the one Trooply exists to answer in under 200 ms.
The mental model
Forget "image search" for a second. Trooply is a similarity engine. When you index a product, we convert its image into a 768-dimensional vector — a point in a very high-dimensional space — using a vision model called CLIP ViT-L/14. Two products that look alike end up close together in that space. Two products that don't, end up far apart. That's the whole trick.
The clever part of CLIP — and the reason Trooply handles text queries and image queries on the same index — is that CLIP maps images and captions into the same space. A photo of a red leather tote and the phrase "red leather tote" land near each other, because CLIP was trained on 400 million image/caption pairs to enforce exactly that. So when a shopper types "red leather tote" we embed the text, look up the nearest products, and get back the same kind of results we'd get if they'd uploaded a photo.
What the call actually looks like
The minimum viable visual search is two lines. Upload an image URL, get the top-K products back:
curl -X POST https://search.trooply.ai/v1/search/url \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"image_url":"https://cdn.shop.com/photos/bag.jpg","limit":10}'
You get back a results array sorted by similarity, plus a query_time_ms number. Each hit carries its full metadata payload — whatever you indexed the product with — so the storefront renders the card without a second round-trip.
{
"query_time_ms": 142,
"count": 10,
"results": [
{
"product_id": "SKU-4821",
"similarity_score": 0.87,
"metadata": {"name": "Marla Tote", "price": 189, "category": "Handbags"}
},
...
]
}
What Trooply does on the server side of that call
Between curl and the response, five things happen in sequence:
- Download & normalise the image. We fetch the URL, compress it, strip the background when useful, and resize to the model's 224×224 input.
- Encode to CLIP. The image becomes a 768-dimensional float vector on GPU. ~20 ms on an RTX 5070 Ti.
- Pre-filter. If you passed
filters(category, price range, in-stock, any custom field), we apply those as a structural filter to narrow the candidate set before the vector search runs. - Vector retrieval from Qdrant. Each tenant has its own Qdrant collection. ANN search returns the top ~30 candidates.
- Re-rank. CLIP similarity alone isn't the last word. We blend it with colour-histogram overlap, aspect-ratio distance, category-match score, per-product popularity, and (if configured) merchandising rules. Per-category scoring profiles pick different weight sets for, say, apparel vs electronics.
On every result we also return a _match_breakdown inside metadata with the raw component scores and a one-line human explanation like "Strong match (87%): your query looks like a handbag; dominant colours overlap (red, black)." The portal's Match Analysis modal reads this to explain why the match ranked where it did.
Beyond a single photo
Once you've used /v1/search/url, three related endpoints exist for slightly different shopper behaviours.
Crop: isolate an object before searching
A shopper uploads a street-style photo with a handbag, jacket, and sneakers all visible. If you care about matching just the bag, pass a bounding box alongside the image:
POST /v1/search/crop
{
"image_url": "https://cdn.shop.com/street.jpg",
"bbox": [120, 340, 480, 640]
}
We crop to the region, embed that crop, and run the full pipeline. This is the right endpoint for "search like this patch of the photo" UX.
Multi-image: average several references
Visual-inspiration shoppers pin ideas, not single images. Give us 2–5 references and we'll embed each, average the vectors, and search against the centroid. That suppresses one-off noise (a weird pose, odd lighting) and surfaces products that match the underlying theme.
Fusion: image + text in one query
"Like this, but cheaper and in blue." Pass both an image URL and a query string plus an image_weight (0–1). We interpolate between the image embedding and the text embedding, search once, and the weight controls which side wins when they disagree.
POST /v1/search/fusion
{
"image_url": "https://cdn.shop.com/jacket.jpg",
"query": "blue",
"image_weight": 0.7
}
Why this beats old-school image search
Classical image search matched on pixels — colour histograms, edge detectors, hand-crafted features. It worked for exact-match lookups (near-duplicates, reverse image search) and broke down on almost anything else: a photo of a handbag in a cluttered frame matched random cluttered frames.
CLIP understands concepts, not pixels. That means it generalises across backgrounds, lighting, and even stylised illustrations. A shopper's phone photo of a retail shelf resolves the right product; a product's catalogue shot on a white background resolves the same product; a hand-drawn sketch of the same item gets close. You do not need a duplicate of the exact image in your catalog — you need a conceptually similar one. That's an order-of-magnitude relaxation of the matching requirement.
Where to go next
Once your catalog is indexed, the interesting questions become about control: how do I pin a SKU during a campaign, how do I boost certain brands, how do I suppress out-of-stock, how do I explain a low-confidence result. That's the job of the merchandising engine — read Merchandising without the ticket queue next.