Hacker Newsnew | past | comments | ask | show | jobs | submit | vasa_'s commentslogin

We've done this for a while with cognee, where we have graph completition retrieval that does that + many other things like weighting, self improving feedback and more https://github.com/topoteretes/cognee


The usual benchmarks for language models—Exact Match, F1, and even multi-hop QA datasets—weren’t designed to measure what matters most about persistent AI memory: connecting concepts across time, documents, and contexts.

We just completed our most extensive internal evaluation of cognee to date, using HotPotQA as a baseline. While the results showed strong gains, they also reinforced a growing realization: we need better ways to evaluate how AI memory systems actually perform.

We ran Cognee through 45 evaluation cycles on 24 questions from HotPotQA, using ChatGPT 4o for the analysis. Each part of the evaluation process is affected by the inherent variance in GPT’s output: cognification, answer generation, and answer evaluation. We especially noticed significant variance across different metrics on small runs, which is why we chose the repeated, end-to-end approach.

We compared results using the same questions and setup with:

Mem0 Lightrag Graphiti

While they are standard in QA, EM and F1 scores reward surface-level overlap and miss the core value proposition of AI memory systems. For example, a syntactically perfect answer can be factually wrong, and a fuzzy-but-correct response can be penalized for missing the reference phrasing.

LLMs are inconsistent, that is another issue.

Even HotPotQA assumes all relevant information sits neatly in two paragraphs. That’s not how memory works. Real-world AI memory systems need to link information across documents, conversations, and knowledge domains that traditional QA benchmarks just can’t capture.

Consider the difference:

Traditional QA:

“What year was the company that acquired X founded?”

Memory Challenge:

“How do the concerns raised in last month’s security review relate to the authentication changes discussed in the architecture meeting three weeks ago?”

Only one of these tests long-term knowledge, reasoning across sources, and organizational memory—care to guess which one?

We are working on a new dataset and benchmarks to measure memory, and would love feedback!


cognee - memory tool for AI apps and Agents | Remote and Berlin | Full time | 100k EUR

At cognee we are working on building graph/vector based memory on top of vector and graph stores. Our pipelines had 116k runs last month and we are projecting north of a million in a few months. We need help on the infra side.

Check our OSS tool here: https://github.com/topoteretes/cognee

Open Roles: Platform Engineer - https://apply.workable.com/topoteretes-ug-haftungsbeschrankt...


Hi, founder of cognee here

We have temporal resolution mechanisms we are building, and framework is generalizable enough to build any custom logic. We have a few ideas and some things will be posted there soon.

cognee has notion of nodesets which work similar to tags: https://docs.cognee.ai/core-concepts/node-sets

And also we have graphs per user now available. So, user permisions + graph filtering


Neat, we are doing something similar with cognee, but are letting users define graph ingestion, generation, and retrieval themselves instead of making assumptions: https://github.com/topoteretes/cognee


I worked with dlt guys on exactly that. Using OpenAI functions to generate a schema for the data based on the raw data structure. You can check that work here: https://github.com/topoteretes/PromethAI-Memory It's in the level 1 folder


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: