Tom Hodson - Selfhosting: Miniflux and RSSHub

Like many nerdy, computery types, I like to subscribe to blogs and other content through RSS. RSS is crazy simple, you host a url on a website with a list of posts with titles/URLs/content encoded in XML (I know I know but it only have like 5 tags and is only nested one level deep.) An RSS reader just checks a big list of those URLs every now and then and presents you the latest thing to show up.

Incidentally this is also how podcasts work, at least for a while, Spotify is clearly trying to capture it.

Anyway, I usually use theoldreader to read RSS feeds but lately they’ve implemented a premium version that you have to pay $3 a month for if you have more than 100 feeds (I have 99…).

Honestly, I use their service a lot so somehow $3 doesn’t seem so bad, but it spurred me to look into selfhosting.

Selfhosting seems to be all the rage these days. Probably in response to feeling locked in to corporate mega structures, the aforementioned computery nerdy types have gone looking for ways to maintain their own anarchic web infrastructure. See i.e the indieweb movement, mastodon etc etc etc

So I want to try out some self hosting. Let’s start with an RSS reader. Miniflux seems well regarded. So I popped over their, grabbed a docker-compose.yml, ran docker compose up -d and we seem to be off to the races.

Ok, a nice thing about Miniflux when compared to theoldreader is the former seems to be better at telling you when there’s something wrong with your feeds. It told me about a few blogs it couldn’t reach, notably Derek Lowe’s excellent blog about chemical drug discovery.

That blog has an rss feed, which loads perfectly find in my browser but doesn’t seem to work when outside of that context, i.e in python:

>>> import requests
>>> requests.get("https://blogs.sciencemag.org/pipeline/feed")
<Response [403]>

Playing around a bit more, adding in useragents, accepting cookies and following redirects, I eventually get back a page with a challenge that requires JS to run. This is the antithesis of how RSS should work!

Ok so to fix this I came upon RSSHub which is a kind of RSS proxy, it parses sites that don’t have RSS feeds and generates them for you. I saw that this has pupeteer support so I’m hopping that I can use it to bypass the anti-crawler tactics science.org is using.

Anyway, for how here is a docker-compose.yml for both miniflux and RSSHub. What took me a while to figure out is that docker containers live in their own special network. So to subscribe to a selfhosted RSSHub feed you need to put something like “http://rsshub:1200/” where rsshub is the key to the image in the yaml file below.

EDIT: I got it to work using puppeteer! For now the code is in a branch for which I’ll do a proper PR soon.

version: '3'

services:
  miniflux:
    image: miniflux/miniflux:latest
    # build:
    #   context: .
    #   dockerfile: packaging/docker/alpine/Dockerfile 
    container_name: miniflux
    restart: always
    healthcheck:
      test: ["CMD", "/usr/bin/miniflux", "-healthcheck", "auto"]
    ports:
      - "8889:8080"
    depends_on:
      - rsshub
      - db

    environment:
      - DATABASE_URL=postgres://miniflux:secret@db/miniflux?sslmode=disable
      - RUN_MIGRATIONS=1
      - CREATE_ADMIN=1
      - ADMIN_USERNAME=admin
      - ADMIN_PASSWORD=test123
  db:
    image: postgres:15
    environment:
      - POSTGRES_USER=miniflux
      - POSTGRES_PASSWORD=secret
    volumes:
      - miniflux-db:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "miniflux"]
      interval: 10s
      start_period: 30s

  rsshub:
      # two ways to enable puppeteer:
      # * comment out marked lines, then use this image instead: diygod/rsshub:chromium-bundled
      # * (consumes more disk space and memory) leave everything unchanged
      image: diygod/rsshub
      restart: always
      ports:
          - '1200:1200'
      environment:
          NODE_ENV: production
          CACHE_TYPE: redis
          REDIS_URL: 'redis://redis:6379/'
          PUPPETEER_WS_ENDPOINT: 'ws://browserless:3000'  # marked
      depends_on:
          - redis
          - browserless  # marked

  browserless:  # marked
      image: browserless/chrome  # marked
      restart: always  # marked
      ulimits:  # marked
        core:  # marked
          hard: 0  # marked
          soft: 0  # marked

  redis:
      image: redis:alpine
      restart: always
      volumes:
          - redis-data:/data

volumes:
  miniflux-db:
  redis-data:

## Backup RSS feed list I put a small script in the repo to backup.

python -m env ~/miniflux_python_env
source ~/miniflux_python_env/bin/activate
pip install pyyaml

I’ve collected the code for the docker containers and config together into this repo.

## Backup everything to google drive Use rclone