Press "Enter" to skip to content

devlog 9/15/2024

Railway:

  • Spun up a PostgreSQL database

Scraper:

  • No edits to core scraper during today’s session

Cleaning:

  • I created a series of scripts that use regex to clean various junk in the body text, like article recommendations and affiliate disclaimers
  • I was doing this on the database itself, but it just makes more sense to do it on the JSONL before it’s uploaded for a multitude of reasons

Uploading:

  • The first thing I coded this morning was the script that actually uploads the data to the database. Once I got it into the database I noticed all the junk and artifacts 😅

Future improvements:

  • Improve the scraper’s parsing so that the junk data never enters the pipeline
  • Use JSONL files as local backups, but directly upload scraped data to database once the scraper is debugged

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *