fix: normalize discovery URLs to root domain
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions

Strip paths from candidate URLs (e.g. /kontakt/, /about/) to always
save root domain. Deduplicates results pointing to same domain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Maciej Pienczyn 2026-02-21 09:09:50 +01:00
parent ced2d0337e
commit 880f5a6715

View File

@ -164,6 +164,19 @@ def _extract_phones(text):
return list(dict.fromkeys(phones))[:5]
def _normalize_url_to_root(url):
"""Strip path from URL, keep only scheme + domain (root page)."""
try:
parsed = urlparse(url)
scheme = parsed.scheme or 'https'
netloc = parsed.netloc
if not netloc:
return url
return f'{scheme}://{netloc}/'
except Exception:
return url
def _is_directory_domain(url):
"""Check if URL belongs to a known business directory."""
try:
@ -248,9 +261,16 @@ class WebsiteDiscoveryService:
# Evaluate top 3 candidates, pick the best
best_candidate = None
best_score = -1
seen_urls = set()
for brave_result in urls[:3]:
url = brave_result['url']
url = _normalize_url_to_root(brave_result['url'])
# Skip duplicate root URLs (e.g. /kontakt/ and /about/ on same domain)
if url in seen_urls:
continue
seen_urls.add(url)
domain = urlparse(url).netloc.lower()
if domain.startswith('www.'):
domain = domain[4:]