Overview
A sophisticated web scraping system that extracts court case data from county court portals, including filing dates, case numbers, party names, and addresses. The system navigates complex web interfaces protected by AWS WAF and solves audio CAPTCHAs automatically using AI speech recognition.
The Challenge
Court records are public information, but accessing them at scale is intentionally difficult. The portals use CAPTCHAs, session management, pagination, and anti-bot protections. A legal services firm needed bulk access to filing data across multiple counties for lead generation — something that would take a human researcher weeks to compile manually.
What I Built
- Selenium automation with Firefox WebDriver navigating the Odyssey court portal system used by Georgia counties
- Audio CAPTCHA solver — the system requests the audio version of the CAPTCHA, applies frequency filtering to isolate the voice from background noise, then transcribes it using the Whisper speech recognition API
- Multi-scraper architecture handling different court case types (civil, criminal, domestic) with separate extraction logic
- Pagination and deduplication ensuring complete data capture without duplicates across multi-page result sets
- CSV data pipeline outputting clean, structured data ready for import into the client's CRM
- AWS WAF evasion through realistic browser fingerprinting and request timing
Tech Stack
Python, Selenium, BeautifulSoup4, Whisper API (DeepInfra), signal processing (frequency filtering), Firefox WebDriver, CSV
