Self-hosting & web scraping (September 14)

Railway:

Backed off using NGINX Proxy Manager on Railway
Migrated docs to Docmost (tried some others, didn’t deploy, works fine)
DNS and custom domain working for Ghost
There are issues with Hover’s DNS situation that make Cloudflare necessary

AI:

Actually ponied up for Claude, it’s just so good at coding, it’s insane.

Research:

Scraping script perfected in ~2-3 hours with Claude 3.5
- Great features like clean exit, rate limiting, resume from exit, no clobber.
- Took around 50 turns across 7 chats
- Avoided the token limit by asking for a readme using the following prompt:

Write a readme for this script that includes every feature and implementation detail. A product manager should be able to recreate this script from this readme.

In particular, explicitly lay out CSS selectors and parsing implementations for each field.

This worked surprisingly well. Using the output I was able to regenerate the project in a new chat with enough context room for large HTML samples to pluck CSS selectors from.
With that, the script can perfectly scrape the headline, author(s), publish date, URL, and body text. Newlines are preserved, thank goodness.
Source extraction
- I did a brief trial using Gemini Pro Experimental (with it’s 2M context size) to extract every source named in scraped news stories along with their job title and company.
- It worked well, but it needs a more explicit prompt to avoid grabbing the article’s author. Still, worked great.
- I have yet to test it, but I would love to process this part of the workflow on the local GPUs with a smaller model.
  
  Update: I was rate-limited by Claude that I gave up on the Pro account.

Self-hosting & web scraping (September 14)

Railway:

AI:

Research:

Be First to Comment

Leave a Reply Cancel reply