Press "Enter" to skip to content

Web scraping updates (September 15)

Published September 15, 2024 by Evan Rodgers

Railway:

Spun up a PostgreSQL database

Scraper:

No edits to core scraper during today’s session

Cleaning:

I created a series of scripts that use regex to clean various junk in the body text, like article recommendations and affiliate disclaimers
I was doing this on the database itself, but it just makes more sense to do it on the JSONL before it’s uploaded for a multitude of reasons

Uploading:

The first thing I coded this morning was the script that actually uploads the data to the database. Once I got it into the database I noticed all the junk and artifacts 😅

Future improvements:

Improve the scraper’s parsing so that the junk data never enters the pipeline
Use JSONL files as local backups, but directly upload scraped data to database once the scraper is debugged

Read more posts about devlog

Be First to Comment

Leave a Reply Cancel reply