QDRANTS-SEMANTISCHE SEUSE RETURUS BEILTE MATCH-STADEN anstelle von kontextuell korrektem Dokument

Anonymous · Post by **Anonymous** » 06 Oct 2025, 13:35

Problembeschreibung
Ich arbeite an einer semantischen Suche + Fuzzy -Matching -Pipeline mit Qdrant und Langchain, um kontextbezogene Koranverse (AYAHs) abzurufen. Die folgende Funktion (retrieve_node) ist ausgelegt, um: < /p>

Code: Select all

1.  Run a similarity search (k=50)
2.  Extract and stem keywords
3.  Apply fuzzy Levenshtein-based scoring
4.  Combine semantic score + keyword ratio
5.  Use softmax to normalize and pick the most relevant ayah

Schauen Sie sich meine Website an: bab-e-ilm.com
Wenn ich abfrage:

Code: Select all

what quran tells about duration of fasting in ramdan

Ich erwarte, dass das System Surah 2: 187 zurückgibt, in dem die Dauer und Start-/Endregeln des Fastens erörtert werden. Der Code gibt jedoch konsequent Surah 2: 185 zurück, was allgemeiner ist. Auch wenn semantisch schließen, ist Surah 2: 187 das korrekte kontextbezogene Übereinstimmung für diese Abfrage. Ayah 185
[*]

Erwartet: Surah 2, Ayah 187

Relevanter Code (volle Funktion)

Code: Select all

def retrieve_node(state: AgentState):
# Step 1: Broader recall using similarity search
docs_with_scores = retriever.vectorstore.similarity_search_with_score(
state["query"], k=50
)
docs = []
for d, score in docs_with_scores:
d.metadata["score"] = score  # Store the initial similarity
docs.append(d)

# Step 2: Prepare keywords from the query (no over-filtering)
stop_words = {
"a", "an", "and", "are", "as", "at", "be", "by", "for", "from", "has", "he",
"in", "is", "it", "its", "of", "on", "that", "the", "to", "was", "were", "will",
"with", "what", "how", "when", "where", "why", "about", "tells", "or", "but"
}
raw_keywords = re.findall(r'\w+', state["query"].lower())
query_keywords = [k for k in raw_keywords if k not in stop_words and len(k) > 3]
stemmed_query_keywords = [simple_stem(k) for k in query_keywords]

# Step 3: Fuzzy match & combined scoring
scored_docs = []
for d in docs:
if not d.metadata.get("surah"):
continue
content_lower = d.page_content.lower()
content_words = set(simple_stem(w) for w in re.findall(r'\w+', content_lower))

# Fuzzy keyword match count
keyword_matches = 0
for k in stemmed_query_keywords:
# Minimum edit distance to words in doc
best_dist = min(levenshtein(k, w) for w in content_words) if content_words else 999
if best_dist   0.05
]
if not final_docs:
# fallback
final_docs = [d for d in docs if d.metadata.get("surah")][:20]

# Step 6: Group by surah
surah_groups = {}
for d in final_docs:
surah = d.metadata["surah"]
surah_groups.setdefault(surah, []).append(d)

if not surah_groups:
selected_docs = docs[:1]
merged = "\n".join([d.page_content for d in selected_docs])
return {"context": merged}

# Fix the surah scoring bug (use group docs correctly)
surah_scores = {}
for sur, sur_docs in surah_groups.items():
sur_soft_scores = [doc.metadata.get("soft_score", 0) for doc in sur_docs]
surah_scores[sur] = float(np.mean(sur_soft_scores)) if sur_soft_scores else 0.0

best_surah = max(surah_scores, key=surah_scores.get)
docs_in_best_surah = surah_groups[best_surah]

# Step 7: Find the central ayah (max soft_score)
best_doc = max(scored_docs, key=lambda x: x.metadata.get("combined_score", 0))

print(best_doc)
central_ayah = int(best_doc.metadata["ayah"])

# Step 8 (Option 2): Return ±2 ayahs context window
context_window = 2
min_ayah = central_ayah
max_ayah = central_ayah
print(central_ayah)
print(max_ayah)

# Qdrant scroll to get all docs in range
client = retriever.vectorstore.client
collection_name = retriever.vectorstore.collection_name

scroll_filter = models.Filter(
must=[
models.FieldCondition(
key="metadata.surah",
match=models.MatchValue(value=best_surah)
),
models.FieldCondition(
key="metadata.ayah",
range=models.Range(gte=min_ayah, lte=max_ayah)
)
]
)

scrolled_points, _ = client.scroll(
collection_name=collection_name,
scroll_filter=scroll_filter,
limit=50,
with_payload=True,
with_vectors=False
)

selected_docs = [
Document(
page_content=point.payload["page_content"],
metadata=point.payload["metadata"]
)
for point in scrolled_points
]

if not selected_docs:
selected_docs = [best_doc]

merged = "\n".join(d.page_content for d in selected_docs)
print(merged)
return {"context": merged}

Wie kann ich die Bewertungs- und Auswahllogik so ändern>

QDRANTS-SEMANTISCHE SEUSE RETURUS BEILTE MATCH-STADEN anstelle von kontextuell korrektem Dokument

QDRANTS-SEMANTISCHE SEUSE RETURUS BEILTE MATCH-STADEN anstelle von kontextuell korrektem Dokument ⇐ Python

Quick Reply