Press "Enter" to skip to content

devlog 9/14/2024

Railway:

  • Backed off using NGINX Proxy Manager on Railway
  • Migrated docs to Docmost (tried some others, didn’t deploy, works fine)
  • DNS and custom domain working for Ghost
  • There are issues with Hover’s DNS situation that make Cloudflare necessary

AI:

  • Actually ponied up for Claude, it’s just so good at coding, it’s insane.

Research:

  • Scraping script perfected in ~2-3 hours with Claude 3.5
    • Great features like clean exit, rate limiting, resume from exit, no clobber.
    • Took around 50 turns across 7 chats
    • Avoided the token limit by asking for a readme using the following prompt:
Write a readme for this script that includes every feature and implementation detail. A product manager should be able to recreate this script from this readme.

In particular, explicitly lay out CSS selectors and parsing implementations for each field.
  • This worked surprisingly well. Using the output I was able to regenerate the project in a new chat with enough context room for large HTML samples to pluck CSS selectors from.
  • With that, the script can perfectly scrape the headline, author(s), publish date, URL, and body text. Newlines are preserved, thank goodness.
  • Source extraction
    • I did a brief trial using Gemini Pro Experimental (with it’s 2M context size) to extract every source named in scraped news stories along with their job title and company.
    • It worked well, but it needs a more explicit prompt to avoid grabbing the article’s author. Still, worked great.
    • I have yet to test it, but I would love to process this part of the workflow on the local GPUs with a smaller model.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *