Open Search LLMs – White Hat Group

We are looking for open-source, non-proprietary Large Language Models (LLMs) that are designed for search tasks. Note that while many LLMs can be used for search (e.g., via retrieval-augmented generation), we focus on models that are explicitly developed for search or are commonly used in that context.

Important: The models listed should be open-source (with licenses allowing use, modification, and distribution) and not proprietary.

Popular open-source LLMs for search (as of my last update in 2023, but I’ll include recent ones I know) include:

BGE (BAAI General Embedding) models: Developed by the Beijing Academy of Artificial Intelligence, these are embedding models designed for retrieval. They are open-source and have strong performance on embedding tasks (which are critical for search). Example: BGE-large-en-v1.5.
GTE (General Text Embeddings): Developed by the Alibaba DAMO Academy. Open-source and designed for text embeddings. Example: GTE-large.
Instructor models: Developed by the HKUST. These models are designed to generate embeddings based on instructions, which can be useful for search. Example: instructor-large.
E5 models (Embeddings from Enhanced Eigenvalue Decomposition): Developed by Microsoft. These are text embedding models that are open-source and have been trained for retrieval tasks. Example: intfloat/e5-large (on Hugging Face).
Contriever: Developed by Facebook AI (Meta). It’s a contrastive learning method for training unsupervised dense retrievers. The models are open-sourced.
DPR (Dense Passage Retriever): Developed by Facebook AI. Although older, it is a foundational model for dense retrieval and is open-source.
ColBERT (Contextualized Late Interaction): Developed by Stanford. It’s a highly effective model for passage retrieval and open-source.
ANCE (Approximate Nearest Neighbor Negative Contrastive Learning): Developed by Microsoft. An open-source model that uses efficient negative sampling for training dense retrievers.
SPLADE (SParse Lexical AnD Embedding): A family of models that produce sparse embeddings, which are efficient for retrieval. Open-source.
Sentence Transformers: This is a framework for creating sentence and text embeddings. Many pre-trained models are available, such as:
- multi-qa-MiniLM-L6-cos-v1: A compact model for semantic search.
- all-mpnet-base-v2: A strong general-purpose model for embeddings.
- msmarco-distilbert-base-v4: Trained on MS MARCO for search.
OpenAI’s text-embedding-ada-002: Wait, note that although OpenAI provides embeddings, the model itself is proprietary (not open-source). So we exclude it.
Models from the MTEB (Massive Text Embedding Benchmark) leaderboard: Many top models on the MTEB are open-source. For example:
- gtr-t5-large: Google’s T5-based model for retrieval (open-source).
- sentence-t5-xxl: Also from Google, based on T5.
Jina Embeddings: Open-source embedding models from Jina AI. Example: jina-embeddings-v2-base-en.
Nomic Embed Text: Open-source model from Nomic AI. Example: nomic-embed-text-v1.
BGE-M3: The latest model from BAAI, which supports multi-functionality (dense, sparse, and multi-vector) for retrieval. Open-source. Note: We are focusing on models that are primarily used for the retrieval step in search (i.e., creating embeddings for documents and queries). For the re-ranking step, there are other models (like cross-encoders, e.g., cross-encoder/ms-marco-MiniLM-L-6-v2 from the Sentence Transformers). Also, note that some models are based on BERT, RoBERTa, T5, etc., and are fine-tuned for retrieval. Since the field is evolving rapidly, the list may not be exhaustive but includes the most popular ones. Let’s list them in a structured way:
BGE series (from BAAI): BAAI/bge-large-en-v1.5, BAAI/bge-base-en-v1.5, etc.
GTE series (from Alibaba): thenlper/gte-large, thenlper/gte-base.
Instructor series (from HKUST): hkunlp/instructor-large, hkunlp/instructor-xl.
E5 series (from Microsoft): intfloat/e5-large, intfloat/e5-base-v2.
Contriever (from Meta): facebook/contriever, facebook/contriever-msmarco.
DPR (from Meta): facebook/dpr-question_encoder-single-nq-base.
ColBERT (from Stanford): stanford-futuredata/ColBERT.
ANCE (from Microsoft): microsoft/ance-dpr-question_encoder-multi.
SPLADE (from various): naver/splade-cocondenser-ensembledistil, naver/splade-v2-max.
Sentence Transformers models (from the sentence-transformers library):
- sentence-transformers/all-mpnet-base-v2
- sentence-transformers/multi-qa-mpnet-base-dot-v1
- sentence-transformers/msmarco-distilbert-base-v4
- … and many others.
Jina Embeddings: jinaai/jina-embeddings-v2-base-en.
Nomic Embed Text: nomic-ai/nomic-embed-text-v1.
BGE-M3: BAAI/bge-m3. Additionally, there are models that are not embedding models but are sometimes used in the reranking stage (like cross-encoders). But for the core retrieval, the above are primarily embedding models. Also, note that some models are multilingual (like BGE, GTE, E5, etc.) and there are versions for different languages. How to choose? Check the MTEB (Massive Text Embedding Benchmark) leaderboard for the latest state-of-the-art: https://huggingface.co/spaces/mteb/leaderboard However, note that the leaderboard includes both open and proprietary models. Since the user asked for open (non-proprietary) models, we have listed the popular ones above. Let me structure the answer accordingly.

Here’s a list of popular open-source (non-proprietary) LLMs optimized for search tasks like retrieval, ranking, and semantic understanding. These models are typically used as embedding models or retrieval/re-ranking engines in search pipelines:

Top Open-Source Embedding Models (for Semantic Search)

BGE (BAAI General Embedding) Series

Developed by: Beijing Academy of AI
Models: BAAI/bge-large-en-v1.5, bge-base-en, bge-small-en, bge-micro
Features: State-of-the-art on MTEB leaderboard, supports instruction-based embeddings.
License: MIT

GTE (General Text Embeddings)

Developed by: Alibaba DAMO Academy
Models: thenlper/gte-large, gte-base, gte-small
Features: Strong cross-domain performance, multilingual support.
License: Apache 2.0

E5 (EmbEddings from bidirEctional Encoder rEpresentations)

Developed by: Microsoft
Models: intfloat/e5-large-v2, e5-base-v2, e5-small
Features: Trained with contrastive learning, excellent for asymmetric search.
License: MIT

Jina Embeddings v2

Developed by: Jina AI
Models: jinaai/jina-embeddings-v2-base-en, v2-large-en
Features: 8K context length, fine-tuned for search.
License: Apache 2.0

Nomic Embed Text v1.5

Developed by: Nomic AI
Model: nomic-ai/nomic-embed-text-v1.5
Features: 8192-token context, Matryoshka training for flexible embedding sizes.
License: Apache 2.0

Specialized Retrieval & Reranking Models

Contriever

Developed by: Meta
Models: facebook/contriever, contriever-msmarco
Use Case: Unsupervised dense retrieval.
License: MIT

ColBERTv2 / ColBERT-X

Developed by: Stanford
Models: colbert-ir/colbertv2.0, ColBERT-X
Use Case: Late-interaction retrieval with token-level embeddings.
License: Apache 2.0

SPLADE

Models: naver/splade-cocondenser-ensembledistil
Use Case: Sparse retrieval (combines lexical and semantic signals).
License: MIT

Cross-Encoders for Re-Ranking

Models:
- cross-encoder/ms-marco-MiniLM-L-12-v2 (Sentence Transformers)
- BAAI/bge-reranker-large
Use Case: Rerank search results with high precision.

Frameworks/Libraries for Search

Sentence Transformers:
Library for training/evaluating embedding models (GitHub). Includes many search-optimized models like all-mpnet-base-v2 and multi-qa-mpnet-base-dot-v1.
HyDE (Hypothetical Document Embeddings):
Technique using generative models to improve retrieval (implementation available in LangChain).

Key Considerations

Licensing: All listed models use permissive licenses (MIT, Apache 2.0).
Deployment: Optimized for CPU/GPU via ONNX or Hugging Face transformers.
Benchmarks: Check the MTEB Leaderboard for performance comparisons.

For most search applications (RAG, semantic search), BGE, GTE, or E5 are recommended starting points due to their balance of performance, speed, and scalability. Pair with a cross-encoder reranker (like bge-reranker) for high-accuracy results.