World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video

AI models are only as good as the data they're trained on. That data generally needs to be labeled, curated and organized before models can learn from it in an effective way.

One of the big missing links in the AI ecosystem has been the availability of a large high-quality open-source multimodal dataset. That changes today with the debut of the EMM-1 dataset which is comprised of 1 billion data pairs and 100M data groups across 5 modalities: text, image, video, audio and 3d point clouds. Multimodal datasets combine different types of data that AI systems can process together. This mirrors how humans perceive the world using multiple senses simultaneously. These datasets enable AI systems to make richer inferences by understanding relationships across data types, rather than processing each modality in isolation.

EMM-1 is developed by data labeling platform vendor Encord. The company's platform enables teams to curate, label and manage training data at scale using both automated and human-in-the-loop workflows. Alongside the new model, Encord developed the EBind training methodology that prioritizes data quality over raw computational scale. The approach enabled a compact 1.8 billion parameter model to match the performance of models up to 17 times larger while slashing training time from days to hours on a single GPU rather than GPU clusters.

"The big trick for us was to really focus on the data and to make the data very, very high quality," Encord Co-Founder and CEO Eric Landau told VentureBeat in an exclusive interview. "We were able to get to the same level of performance as models 20 times larger, not because we were super clever on the architecture, but because we trained it with really good data overall."

The data quality advantage

Encord's dataset is 100 times larger than the next comparable multimodal dataset, according to Landau. It operates at petabyte scale with terabytes of raw data and over 1 million human annotations.

But scale alone doesn't explain the performance gains. The technical innovation centers on addressing what Landau calls an "under-appreciated" problem in AI training: data leakage between training and evaluation sets.

"The leakage problem was one which we spent a lot of time on," Landau explained. "In a lot of data sets, there is a kind of leakage between different subsets of the data. Leakage actually boosts your results. It makes your evaluations look better. But it's one thing that we were quite diligent about."

Data leakage occurs when information from test data inadvertently appears in training data, artificially inflating model performance metrics. Many benchmark datasets suffer from this contamination. Encord deployed hierarchical clustering techniques to ensure clean separation while maintaining representative distribution across data types. The company also used clustering to address bias and ensure diverse representation.

How EBind boosts efficiency

The data quality improvements work in tandem with an architectural approach designed for efficiency

Encord's EBind extends the CLIP (Contrastive Language-Image Pre-training) approach (originally developed by OpenAI) from two modalities to five. CLIP learns to associate images and text in a shared representation space, enabling tasks like searching for images using text descriptions.

Where CLIP learns to associate images and text in a shared latent space, EBind does the same across images, text, audio, 3D point clouds and video.

The architectural choice prioritizes parameter efficiency. Rather than deploying separate specialized models for each modality pair, EBind uses a single base model with one encoder per modality.

"Other methodologies, what they do is they use a bunch of different models, and they route to the best model for embedding these pairs, so they tend to explode in the number of parameters," Landau said. "We found we could use a single base model and just train one encoder per modality, so keeping it very simple and very parameter efficient, if we fed that overall architecture really, really good data."

The resulting model rivals OmniBind, a much larger competitor in the multimodal space, but requires dramatically fewer computational resources for both training and inference. This makes EBind deployable in resource-constrained environments including edge devices for robotics and autonomous systems.

The enterprise value of a multi-modal dataset

Multimodal models enable enterprise use cases that span different data types.

Most organizations store different data types in separate systems: documents in content management platforms, audio recordings in communication tools, training videos in learning management systems and structured data in databases. Multimodal models can search and retrieve across all of these simultaneously.

"Enterprises have all different types of data. They don't just have documents. They have audio recordings, and they have training videos, and they have CSV files," Landau said. "Let's say you're a lawyer and you have a case file that has video evidence and also documents and recordings, and it's all scattered across a lot of silos of data. You can use EBind to pick all of the relevant data and bundle together to search and surface the right data much quicker than you would have before."

The same principle applies across verticals. Healthcare providers can link patient imaging data to clinical notes and diagnostic audio. Financial services firms can connect transaction records to compliance call recordings and customer communications. Manufacturing operations can tie equipment sensor data to maintenance video logs and inspection reports.

Beyond office environments, physical AI represents another frontier. Landau highlighted autonomous vehicles that benefit from both visual perception and audio cues like emergency sirens. In manufacturing and warehousing, robots that combine visual recognition with audio feedback and spatial awareness can operate more safely and effectively than vision-only systems.

Enterprise use case: Extending computer vision with multimodal context

Captur AI, an Encord customer, illustrates how companies are planning to use the dataset for specific business applications. The startup provides on-device image verification for mobile apps, validating photos in real-time for authenticity, compliance and quality before upload. The company works with shared mobility providers like Lime and delivery companies capturing billions of package photos.

Captur AI processes over 100 million images on-device and specializes in distilling models to 6-10 megabytes so they can run on smartphones without cloud connectivity. But CEO Charlotte Bax sees multimodal capabilities as critical for expanding into higher-value use cases.

"The market for us is massive. You submit photos for returns and retails. You submit photos to insurance companies for claims. You submit photos when you're listing something on eBay," Bax told VentureBeat in an exclusive interview. "Some of those use cases are very high risk or high value if something goes wrong, like insurance, the image only captures part of the context and audio can be an important signal."

Bax cited digital vehicle inspections as a prime example. When customers photograph vehicle damage for insurance claims, they often describe what happened verbally while capturing images. Audio context can significantly improve claim accuracy and reduce fraud.

"As you're doing that, oftentimes the customer is actually describing what's happened," Bax said. "A few of our potential prospects in InsurTech have asked us if we can actually do audio as well, because then that adds this additional bit of context for the user who's submitting the claim."

The challenge lies in maintaining Captur AI's core advantage: running models efficiently on-device rather than requiring cloud processing. The company plans to use Encord's dataset to train compact multimodal models that preserve real-time, offline capabilities while adding audio and sequential image context.

"The most important thing you can do is try and get as much context as possible," Bax said. "Can you get LLMs to be small enough to run on a device within the next three years, or can you run multimodal models on the device? Solving data quality before image upload is the interesting frontier."

What this means for enterprises

Encord's results challenge fundamental assumptions about AI development and suggest that the next competitive battleground may be data operations rather than infrastructure scale.

Multimodal datasets unlock new capabilities. The ability to train models that understand relationships across data types opens use cases that single-modality systems cannot address.

Data operations deserve equal investment with compute infrastructure. The 17x parameter efficiency gain from better data curation represents orders of magnitude in cost savings. Organizations pouring resources into GPU clusters while treating data quality as an afterthought may be optimizing the wrong variable.

For enterprises building multimodal AI systems, Landau's assessment captures the strategic shift.

"We were able to get to the same level of performance as models much larger, not because we were super clever on the architecture, but because we trained it with really good data overall," he said.

Source link