Skip to Content
RAGRetrieval Pipeline

Retrieval Pipeline

Rosetta retrieval has two stages: ingest and search. Before an agent generates a response, Rosetta pulls supporting passages from your library and the literature, then sends the retrieved passages with the question.

Ingest

1. Add A Source

Sources come from uploads, pasted text, account sync, or PubMed search.

2. Normalize The Payload

Rosetta normalizes the source into a RagSource record.

3. Extract Text When Needed

PDF uploads are parsed in the browser with pdfjs-dist. If Rosetta cannot extract usable text, the source is still stored, but it is marked so it does not enter normal text ingest.

4. Preserve Scope

Each source is treated as either:

  • local
  • account

Account sources can be merged into the current folder when the user is signed in.

Search runs in this order:

  1. collect eligible text
  2. embed sources
  3. embed the query
  4. score matches
  5. return top sources
  6. build the cited prompt

Retrieval Defaults

The semantic search flow:

  • chunks source documents into passages
  • embeds text content with text-embedding-004
  • uses cosine similarity for matching
  • returns a small result set
  • builds the prompt with [Source N] labels

Retrieval Controls

  • Source filters: limit retrieval to a specific library, such as local, account, or public sources.
  • Date windows: limit to a time range, for example guidelines from the last five years.
  • Relevance thresholds: drop chunks below a minimum similarity score so weak matches do not dilute the context.

PubMed Path

PubMed follows a parallel path:

  1. Rosetta searches PubMed
  2. loads summaries and abstracts
  3. converts articles into RAG sources
  4. lets the user add the selected articles to the active context

Practical Implication

If you want the retriever to use a source, make sure Rosetta has usable text for it and that the source is included in ingest.

Last updated on