fix: normalize discovery URLs to root domain
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
Some checks are pending
NordaBiz Tests / Unit & Integration Tests (push) Waiting to run
NordaBiz Tests / E2E Tests (Playwright) (push) Blocked by required conditions
NordaBiz Tests / Smoke Tests (Production) (push) Blocked by required conditions
NordaBiz Tests / Send Failure Notification (push) Blocked by required conditions
Strip paths from candidate URLs (e.g. /kontakt/, /about/) to always save root domain. Deduplicates results pointing to same domain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
ced2d0337e
commit
880f5a6715
@ -164,6 +164,19 @@ def _extract_phones(text):
|
||||
return list(dict.fromkeys(phones))[:5]
|
||||
|
||||
|
||||
def _normalize_url_to_root(url):
|
||||
"""Strip path from URL, keep only scheme + domain (root page)."""
|
||||
try:
|
||||
parsed = urlparse(url)
|
||||
scheme = parsed.scheme or 'https'
|
||||
netloc = parsed.netloc
|
||||
if not netloc:
|
||||
return url
|
||||
return f'{scheme}://{netloc}/'
|
||||
except Exception:
|
||||
return url
|
||||
|
||||
|
||||
def _is_directory_domain(url):
|
||||
"""Check if URL belongs to a known business directory."""
|
||||
try:
|
||||
@ -248,9 +261,16 @@ class WebsiteDiscoveryService:
|
||||
# Evaluate top 3 candidates, pick the best
|
||||
best_candidate = None
|
||||
best_score = -1
|
||||
seen_urls = set()
|
||||
|
||||
for brave_result in urls[:3]:
|
||||
url = brave_result['url']
|
||||
url = _normalize_url_to_root(brave_result['url'])
|
||||
|
||||
# Skip duplicate root URLs (e.g. /kontakt/ and /about/ on same domain)
|
||||
if url in seen_urls:
|
||||
continue
|
||||
seen_urls.add(url)
|
||||
|
||||
domain = urlparse(url).netloc.lower()
|
||||
if domain.startswith('www.'):
|
||||
domain = domain[4:]
|
||||
|
||||
Loading…
Reference in New Issue
Block a user