Railway:
- Backed off using NGINX Proxy Manager on Railway
- Migrated docs to Docmost (tried some others, didn’t deploy, works fine)
- DNS and custom domain working for Ghost
- There are issues with Hover’s DNS situation that make Cloudflare necessary
AI:
- Actually ponied up for Claude, it’s just so good at coding, it’s insane.
Research:
- Scraping script perfected in ~2-3 hours with Claude 3.5
- Great features like clean exit, rate limiting, resume from exit, no clobber.
- Took around 50 turns across 7 chats
- Avoided the token limit by asking for a readme using the following prompt:
Write a readme for this script that includes every feature and implementation detail. A product manager should be able to recreate this script from this readme.
In particular, explicitly lay out CSS selectors and parsing implementations for each field.
- This worked surprisingly well. Using the output I was able to regenerate the project in a new chat with enough context room for large HTML samples to pluck CSS selectors from.
- With that, the script can perfectly scrape the headline, author(s), publish date, URL, and body text. Newlines are preserved, thank goodness.
- Source extraction
- I did a brief trial using Gemini Pro Experimental (with it’s 2M context size) to extract every source named in scraped news stories along with their job title and company.
- It worked well, but it needs a more explicit prompt to avoid grabbing the article’s author. Still, worked great.
- I have yet to test it, but I would love to process this part of the workflow on the local GPUs with a smaller model.
Be First to Comment