Scraping refactor:
- I made a proof-of-concept that crawled, scraped, and parsed the news site we’ve been optimizing around. Originally it was set up to scrape locally and output the data into a JSONL file, tracking state with text files.
- This afternoon I streamlined the scraper. It scrapes the site directly and uploads the data directly into the PostgreSQL database without any intermediary outputs.
- I completely redesigned the body text scraper. Instead of trying to parse the HTML, I just send the entire chunk to an LLM with a prompt. This works great with Google Gemini, but terribly with open models. Regex and BS4 didn’t seem like a viable path.
- As mentioned, open models failed hilariously in this test. I will try that LongWriter model and report back.
Can’t wait to dive into this dataset 🥰
Be First to Comment