Railway:
- Spun up a PostgreSQL database
Scraper:
- No edits to core scraper during today’s session
Cleaning:
- I created a series of scripts that use regex to clean various junk in the body text, like article recommendations and affiliate disclaimers
- I was doing this on the database itself, but it just makes more sense to do it on the JSONL before it’s uploaded for a multitude of reasons
Uploading:
- The first thing I coded this morning was the script that actually uploads the data to the database. Once I got it into the database I noticed all the junk and artifacts 😅
Future improvements:
- Improve the scraper’s parsing so that the junk data never enters the pipeline
- Use JSONL files as local backups, but directly upload scraped data to database once the scraper is debugged
Be First to Comment