Under the hood · AI models
Local for reading, cloud for thinking
Korely uses two kinds of AI under the hood. Small local models handle the work that runs on every save and query (embeddings, entity extraction, transcription). One cloud model handles the work that runs once per user action (chat, meeting recaps, query rewriting). Free is fully local. Pro adds the cloud reasoning model. Today that slot is filled by Gemini Flash, one of a wider family compared further down.
The split
The split: reading vs thinking
Think of staffing a small hotel. The receptionist is fast and handles every guest who walks through the door, dozens of small interactions a day. The concierge is slower but knows the city like the back of their hand, and handles a few tricky requests a day. Korely's AI stack is wired the same way. Local models are the receptionist. Cloud Gemini is the concierge. Two helpers, same hotel, different jobs.
Concretely, the work splits into two buckets.
- Reading-heavy work runs hundreds of times a day on every save and query: producing an embedding for the note, extracting entities, ranking search results. Small specialised models do this on the CPU in tens of milliseconds.
- Reasoning-heavy work runs once per user action: writing a meeting recap, answering a question over the vault, rewriting a vague query into a precise one. Large general models do this in the cloud in a second or two.
The two halves of the stack do not overlap. The same note you embed locally is the input to the recap that runs in the cloud, but each model is doing the task it's best at.
We thought hard about running the reasoning model locally too. The verdict: today's local LLMs are good enough to answer simple questions, but the gap with a cloud model like Gemini Flash on summarisation and multi-step reasoning is still wide enough that the cost of shipping a reasoning model inside the desktop bundle (download size, RAM at idle, latency on older laptops, battery on the move) buys you less than what one cloud call per user action does. Reading-heavy work stays local. Reasoning-heavy work goes through one cloud call when Pro is on.
The cloud model today
What the cloud model does in Korely today
Pro Korely calls Gemini Flash from Google for the reasoning tasks:
- Meeting recap. After transcription, the transcript is sent to Gemini Flash which returns a sectioned summary: decisions, action items, open questions, notable quotes.
- Chat over your vault. When you ask Korely a question through the built-in chat, Gemini Flash reads the retrieved notes and writes the answer.
- Query rewrite. Vague queries such as "what about Q3" get expanded into more retrievable forms before hitting the search index.
- Multimodal transcription. Dropped audio and video files are processed by Gemini Flash's multimodal mode, which returns transcript, diarization, and summary in a single call.
For live meetings, Korely also calls Groq's Llama 3.3 70B for one narrow task: inferring speaker names from self-introductions such as "Hi, I'm John". Groq runs that job with sub-second latency, so it answers in real time while the recording is still going.
Speech-to-text is its own subsystem with its own tradeoffs (local on Free, cloud on Pro). It is documented in detail on the transcription page.
Local stack
What runs locally on every plan
The local stack is the same on Free and on Pro:
- Note embeddings via Nomic embed v1.5. Powers hybrid search. More on embeddings →
- Entity extraction via GLiNER. Powers the knowledge graph. More on entity extraction →
- Hybrid search via SQLite FTS5 and sqlite-vec, fused with Reciprocal Rank Fusion.
- Knowledge graph stored in SQLite next to the vault. Drives the related notes panel and GraphRAG retrieval.
- Whisper for local transcription of recorded meetings and dropped audio files. Runs on CPU. More on transcription →
- MCP server shipping the five read tools (search, read, related, list notes, list folders), available to Claude Desktop, Cursor, Zed, and other MCP clients on the same machine. More on MCP →
On Free, this is the whole AI stack. Korely makes no outbound calls for AI work. Search, related notes, graph view, MCP, all of it works offline.
In the same family
Other cloud LLMs in the same family
Picture a row of professional concierges sitting in different hotels around the world. They all answer the same kind of tricky question, each with a slightly different style and a slightly different price list. Gemini Flash is the concierge Korely currently dials. Here is the rest of the family, in case the slot ever changes.
- Gemini Flash (Google, commercial API). The concierge Korely dials today. Strong on speed and on multimodal input (audio and video alongside text), a good fit for meeting recaps and dropped media files.
- Anthropic Claude (Anthropic, commercial API). The thoughtful concierge, often preferred for long-form writing, careful summarisation, and code work. Picture the colleague who reads the whole brief twice before answering.
- OpenAI GPT (OpenAI, commercial API). The most established concierge in the building. Wide ecosystem of tools and client libraries, often the default in third-party integrations.
- xAI Grok (xAI, commercial API). The newer concierge, designed to feel less filtered. Strong on real-time information through its integration with X.
- Mistral (Mistral AI, commercial API plus some open weights). The European concierge. Good fit when data residency in the EU matters for a customer.
Korely chose Gemini Flash for two reasons. The first is cost and speed: the Flash tier is priced for high-volume work like recapping meetings, and the latency stays low enough to feel interactive in chat. The second is multimodal input: a single Gemini call can ingest an audio or video file and return a transcript plus a recap, which keeps the meeting upload path simple. The other concierges would shine in different priorities (a vault that needs longer-form writing, or one that needs an EU-resident provider).
Free vs Pro
What Pro adds, in API terms
Pro keeps the whole local stack and layers cloud calls on top for the tasks that need a large model.
- Gemini Flash for meeting recap, chat over the vault, query rewrite, and multimodal transcription.
- Groq Llama 3.3 70B for speaker name inference during live meetings.
- Deepgram for live streaming transcription with diarization.
- Cloud sync of the vault to Korely cloud, opt in.
- Cloud MCP endpoint, OAuth protected, so AI tools on other machines can read your vault remotely.
You pay one subscription. Korely handles the billing for all of the above. You do not bring your own Gemini, Groq, or Deepgram API key. It works like a phone plan: one fee a month, the carrier handles every call you make, and you don't get a separate bill from each cell tower.
Only the text required for the specific task ever leaves your machine. A recap sends the transcript. A chat answer sends the retrieved note snippets and your question. The vault itself is uploaded only if you turn on cloud sync, and that's a separate setting.
Frequently asked
Which cloud AI model does Korely use? +
Gemini Flash from Google for chat, meeting summaries, query rewriting, and multimodal transcription of dropped audio and video files. Pro tier only.
Does the Free tier make any cloud AI calls? +
No. Free is fully local. Whisper runs on your CPU for transcription, Nomic embed v1.5 handles search, GLiNER builds the knowledge graph. No outbound AI calls.
Do I need to bring my own API key? +
No. Pro covers the cloud AI calls as part of the subscription. The billing is handled by Korely, not by your Google or OpenAI account.
What about Groq, Deepgram, and the other names I see in the app? +
Korely uses Groq Llama 3.3 70B for speaker name inference on live meetings (sub second latency), and Deepgram for live streaming transcription with diarization on Pro. Both are part of the same Pro subscription, no separate keys.
What gets sent to the cloud when Pro is on? +
Only the text needed for the specific task. A meeting recap sends the transcript. A chat answer sends the retrieved note snippets and your question. Your full vault is never uploaded for AI processing. Cloud sync (a separate Pro feature) is what copies the vault, and even that you turn on explicitly.
Two stacks, one app
Free is fully local, forever. Pro adds the cloud calls for the work that needs them. You decide which one you want.