Retrieval Pipeline
Rosetta retrieval has two stages: ingest and search. Before an agent generates a response, Rosetta pulls supporting passages from your library and the literature, then sends the retrieved passages with the question.
Ingest
1. Add A Source
Sources come from uploads, pasted text, account sync, or PubMed search.
2. Normalize The Payload
Rosetta normalizes the source into a RagSource record.
3. Extract Text When Needed
PDF uploads are parsed in the browser with pdfjs-dist. If Rosetta cannot extract usable text, the source is still stored, but it is marked so it does not enter normal text ingest.
4. Preserve Scope
Each source is treated as either:
localaccount
Account sources can be merged into the current folder when the user is signed in.
Search
Search runs in this order:
- collect eligible text
- embed sources
- embed the query
- score matches
- return top sources
- build the cited prompt
Retrieval Defaults
The semantic search flow:
- chunks source documents into passages
- embeds text content with
text-embedding-004 - uses cosine similarity for matching
- returns a small result set
- builds the prompt with
[Source N]labels
Retrieval Controls
- Source filters: limit retrieval to a specific library, such as local, account, or public sources.
- Date windows: limit to a time range, for example guidelines from the last five years.
- Relevance thresholds: drop chunks below a minimum similarity score so weak matches do not dilute the context.
PubMed Path
PubMed follows a parallel path:
- Rosetta searches PubMed
- loads summaries and abstracts
- converts articles into RAG sources
- lets the user add the selected articles to the active context
Practical Implication
If you want the retriever to use a source, make sure Rosetta has usable text for it and that the source is included in ingest.