Web scraping updates (September 17)

Scraping refactor:

I made a proof-of-concept that crawled, scraped, and parsed the news site we’ve been optimizing around. Originally it was set up to scrape locally and output the data into a JSONL file, tracking state with text files.
This afternoon I streamlined the scraper. It scrapes the site directly and uploads the data directly into the PostgreSQL database without any intermediary outputs.
I completely redesigned the body text scraper. Instead of trying to parse the HTML, I just send the entire chunk to an LLM with a prompt. This works great with Google Gemini, but terribly with open models. Regex and BS4 didn’t seem like a viable path.
As mentioned, open models failed hilariously in this test. I will try that LongWriter model and report back.

Can’t wait to dive into this dataset 🥰

Be First to Comment