Press "Enter" to skip to content

devlog 9/17/2024

Scraping refactor:

  • I made a proof-of-concept that crawled, scraped, and parsed the news site we’ve been optimizing around. Originally it was set up to scrape locally and output the data into a JSONL file, tracking state with text files.
  • This afternoon I streamlined the scraper. It scrapes the site directly and uploads the data directly into the PostgreSQL database without any intermediary outputs.
  • I completely redesigned the body text scraper. Instead of trying to parse the HTML, I just send the entire chunk to an LLM with a prompt. This works great with Google Gemini, but terribly with open models. Regex and BS4 didn’t seem like a viable path.
  • As mentioned, open models failed hilariously in this test. I will try that LongWriter model and report back.

Can’t wait to dive into this dataset 🥰

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *