A developer created a market map of every company with a Wikipedia article by running all 7.5 million English articles through an embedding model. This allowed for clustering companies by semantic similarity and even identifying them using a calculated "company-ness" vector, a novel approach beyond manual categorization.

Related Insights

To move beyond keyword search in their media archive, Tim McLear's system generates two vector embeddings for each asset: one from the image thumbnail and another from its AI-generated text description. Fusing these enables a powerful semantic search that understands visual similarity and conceptual relationships, not just exact text matches.

Managed vector databases are convenient, but building a search engine from scratch using a library like FAISS provides a deeper understanding of index types, latency tuning, and memory trade-offs, which is crucial for optimizing AI systems.

Instead of just grouping similar news stories, Kevin Rose created an AI-powered "Gravity Engine." This system scores content clusters on qualitative dimensions like "Industry Impact," "Novelty," and "Builder Relevance," providing a sophisticated editorial layer to surface what truly matters.

The original Semantic Web required creators to manually add structured metadata. Now, AI models extract that meaning from unstructured content, creating a machine-readable web through brute-force interpretation rather than voluntary participation.

For decades, the goal was a 'semantic web' with structured data for machines. Modern AI models achieve the same outcome by being so effective at understanding human-centric, unstructured web pages that they can extract meaning without needing special formatting. This is a major unlock for web automation.

Vector search excels at semantic meaning but fails on precise keywords like product SKUs. Effective enterprise search requires a hybrid system combining the strengths of lexical search (e.g., BM25) for keywords and vector search for concepts to serve all user needs accurately.

Contrary to fears that AI will make Wikipedia obsolete, initial data shows AI-generated summaries link to Wikipedia at double the rate of traditional search (6% vs. 3%). While users click through less often for simple queries, Wikipedia's brand visibility and role as a foundational source are being amplified in the AI era.

The next frontier of data isn't just accessing existing databases, but creating new ones with AI. Companies are analyzing unstructured sources in creative ways—like using computer vision on satellite images to count cars in parking lots as a proxy for employee headcounts—to answer business questions that were previously impossible to solve.

Kevin Rose discovered an unexpected use for vector embeddings in his news aggregator. By analyzing the vector distance and publish times of articles on the same topic, he can detect when multiple outlets are part of a paid PR campaign, as the content is nearly identical.

YipitData had data on millions of companies but could only afford to process it for a few hundred public tickers due to high manual cleaning costs. AI and LLMs have now made it economically viable to tag and structure this messy, long-tail data at scale, creating massive new product opportunities.