记录 另类 Discourse 部署

why

Discourse 官方仅支持一种部署方法,即 用 dicourse_docker 仓库里的 shell, yaml, ruby 混合的代码里的代码 构建容器,
导致 每次改) 代码都要花 20 分钟重构容器,非常蛋疼.

本 thread 将记录我的另类部署方式.

为什么要用官方不支持的方式运行 Discourse?

Discourse 唯一的部署方式不允许别人修改源代码的 URL,
导致我想要修改默认运行的方式只能写插件并花 20 分钟重建容器,不能 fork 完直接在源代码上面改,迭代周期长,开发效率极低

Discourse 是个牛逼的软件,界面也很现代,但是安装部署的方式真的是 10 年前的:

创始人 Sam 在十几年前用 ruby 脚本写了个程序 pups 构建容器

Broadcom Bitnami 维护了 另一种部署方法,但也没比官方的好到哪里去。

原厂的 Gemfile 里甚至没有 rails, 只有个自家的 rails_multisite

导致我想要用 Rubymine 调试程序的时候提示我 rails 没有安装?:sweat_smile: 然而我用命令行运行rail s却是可以的,害得我以为是 Rubymine 出 bug 或者配置错了或者 rvm 出 bug 了

procedures

以开发模式运行

开发环境运行 rails 应用的命令:

ALLOW_EMBER_CLI_PROXY_BYPASS=1 DISCOURSE_DEV_LOG_LEVEL=warn DISCOURSE_ENABLE_CORS=true RAILS_DEVELOPMENT_HOSTS=xjtu.app RAILS_ENV=development HOST_URL=xjtu.app DISCOURSE_HOSTNAME=xjtu.app NUM_WEBS=8 rails s

rails 区分开发/生产运行模式,使用的配置不一样,例如开发模式缺少 cache 和 asset minimization,所以访问起来性能非常低下,低得夸张:

https://pagespeed.web.dev/analysis/https-xjtu-app/g5sfcvavle?form_factor=mobile

从 Docker 里复制

本来我想比较一下 config/environments/development.rb 和 production.rb 的配置选项,但无奈 assets pipeline pre-compilation 不太懂,就不学了。直接开大招,把容器里的 discourse 复制出来,找到启动命令,在容器外面直接运行得了

先看看容器的入口 ./launcher start-cmd webxj

  • true run --shm-size=512m --link dataxj:dataxj -d --restart=always -e LANG=en_US.UTF-8 -e RAILS_ENV=production … --name webxj -t -v /var/discourse/shared/webxj:/shared … local_discourse/webxj /sbin/boot

查看 /sbin/boot

root@mnz-webxj:/var/www/discourse# cat /sbin/boot
#!/bin/bash
# we use this to boot up cause runit will not handle TERM and will not exit when done

shutdown() {
  echo Shutting Down
  /etc/runit/3
  ls /etc/service | SHELL=/bin/sh parallel sv force-stop {}
  kill -HUP $RUNSVDIR
  wait $RUNSVDIR

  # give stuff a bit of time to finish
  sleep 0.1

  ORPHANS=`ps -eo pid | grep -v PID  | tr -d ' ' | grep -v '^1$'`
  SHELL=/bin/bash parallel 'timeout 5 /bin/bash -c "kill {} && wait {}" || kill -9 {}' ::: $ORPHANS 2> /dev/null
  exit
}

/etc/runit/1 || exit $?
/etc/runit/2&
RUNSVDIR=$!
echo "Started runsvdir, PID is $RUNSVDIR"
trap shutdown SIGTERM SIGHUP
wait $RUNSVDIR

shutdown

容器里面用的 sv 管理进程,查看cat /etc/service/unicorn/run

#!/bin/bash
exec 2>&1
# redis
# postgres
cd /var/www/discourse
chown -R discourse:www-data /shared/log/rails
PRECOMPILE_ON_BOOT=0
if [[ -z "$PRECOMPILE_ON_BOOT" ]]; then
  PRECOMPILE_ON_BOOT=1
fi
if [ -f /usr/local/bin/create_db ] && [ "$CREATE_DB_ON_BOOT" = "1" ]; then /usr/local/bin/create_db; fi;
if [ "$MIGRATE_ON_BOOT" = "1" ]; then su discourse -c 'bundle exec rake db:migrate'; fi
if [ "$PRECOMPILE_ON_BOOT" = "1" ]; then SKIP_EMBER_CLI_COMPILE=1 su discourse -c 'bundle exec rake assets:precompile'; fi
LD_PRELOAD=$RUBY_ALLOCATOR HOME=/home/discourse USER=discourse exec thpoff chpst -u discourse:www-data -U discourse:www-data bundle exec config/unicorn_launcher -E production -c config/unicorn.conf.rb

参考 Install Discourse on Ubuntu or Debian for Development - Developer Guides - Discourse Meta 安装 ImageMagick, oxipng, jhead

安装 pnpm, rvm 和 ruby

pnpm env use --global lts                                                   
rvm install 3.3      
rvm use 3.3 --default
## 看看是否支持 YJIT
ruby --yjit -v                                                              

把 redis 和 PG 的数据复制一份,再把 start-cmd 里的环境变量复制到.zshrc,新建数据库用户,配置 config/discourse.conf, 复制 bundle 那行启动命令启动bundle exec config/unicorn_launcher -E production -c config/unicorn.conf.rb

# pnpm
export PNPM_HOME="/home/discourse/.local/share/pnpm"
case ":$PATH:" in
  *":$PNPM_HOME:"*) ;;
  *) export PATH="$PNPM_HOME:$PATH" ;;
esac
# pnpm end
alias npm='pnpm'
alias npx='pnpx'
export PATH="$PATH:$HOME/.rvm/bin"

export RAILS_ENV=production
export UNICORN_WORKERS=6
export UNICORN_SIDEKIQS=1
export RUBY_GC_HEAP_GROWTH_MAX_SLOTS=40000
export RUBY_GC_HEAP_INIT_SLOTS=400000
export RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR=1.5
export RUBY_YJIT_ENABLE=1
export RUBY_CONFIGURE_OPTS="--enable-yjit"
export DISCOURSE_HOSTNAME=xjtu.app
...

https://www.postgresql.org/download/linux/ubuntu/

apt install postgresql postgresql-client-17 postgresql-common  postgresql-contrib postgresql-client-common postgresql-server-dev-17 postgresql-17 postgresql-17-pgvector libpq-dev
sudo -u postgres createuser -s discourse                                                   
sudo -u postgres createdb discourse 
$sudo -u postgres psql discourse
psql>
ALTER USER discourse WITH PASSWORD 'xxx';
CREATE EXTENSION hstore;CREATE EXTENSION pg_trgm;
CREATE EXTENSION plpgsql;
CREATE EXTENSION unaccent;
CREATE EXTENSION vector;
$ gunzip < dump.sql.gz | psql discourse      

dump.sql.gz 是 Discourse 备份解压出来的,容器里是 PG13,导入到新安装的 PG17 竟然无比丝滑

再把容器里的 nginx 复制出来,把 cache 和文件目录重新配一下,启动成功

cd /shared/
chown discourse:www-data backups tmp uploads

upgrade

参考:

chown -R discourse:discourse /var/www/discourse/     
chown -R discourse:www-data /var/www/discourse/public
su - discourse
cd /var/www/discourse
git stash
git pull
git checkout tests-passed 
rm lib/tasks/custom.rake db/migrate/20241213085000_add_external_id_to_posts.rb
git apply mypatch20241225v1.patch
# LOAD_PLUGINS=0 bundle exec rake plugin:pull_compatible_all
cd plugins
for plugin in *
do
    echo $plugin
    pushd ${plugin}
    git pull
    popd
done
cd ../
# may need this when migrate to another system
# rm plugins/*/gems -r
bundle install && pnpm i && bundle exec rake db:migrate  && bundle exec rake assets:precompile
pumactl phased-restart

直接用 pg_dump 备份

discour+ 1142358  0.0  0.0   2384  1408 pts/3    S+   13:54   0:00 sh -c PGPASSWORD='191549' pg_dump --schema=public -T public.pg_* --file='/var/www/discourse/tmp/backups/default/2024-12-14-135427/dump.sql.gz' --no-owner --no-privileges --verbose --compress=4 --host=localhost  --username=discourse discourse 2>&1
discour+ 1142359 94.1  0.1  32240 18388 pts/3    R+   13:54   0:12 /usr/lib/postgresql/17/bin/pg_dump --schema=public -T public.pg_* --file=/var/www/discourse/tmp/backups/default/2024-12-14-135427/dump.sql.gz --no-owner --no-privileges --verbose --compress=4 --host=localhost --username=discourse discourse
「いいね!」 5

爆得有点频繁啊 :fearful:

不稳定 & unicorn 换 puma

妈的,最近不定期出现访问不了的诡异情况,检查 unicorn log:

==> ./log/unicorn.stdout.log <==
I, [2024-12-17T09:45:52.927751 #3990911]  INFO -- : worker=4 ready
I, [2024-12-17T09:45:54.856311 #3991183]  INFO -- : worker=5 ready
E, [2024-12-17T09:48:57.888043 #3989435] ERROR -- : Kill self supervisor is gone
I, [2024-12-17T09:48:57.924044 #3989435]  INFO -- : reaped #<Process::Status: pid 3990359 exit 0> worker=0
I, [2024-12-17T09:48:57.924329 #3989435]  INFO -- : reaped #<Process::Status: pid 3990445 exit 0> worker=1
I, [2024-12-17T09:48:57.924473 #3989435]  INFO -- : reaped #<Process::Status: pid 3990574 exit 0> worker=2
I, [2024-12-17T09:48:57.924616 #3989435]  INFO -- : reaped #<Process::Status: pid 3990736 exit 0> worker=3
I, [2024-12-17T09:48:57.924773 #3989435]  INFO -- : reaped #<Process::Status: pid 3990911 exit 0> worker=4
I, [2024-12-17T09:48:57.924926 #3989435]  INFO -- : reaped #<Process::Status: pid 3991183 exit 0> worker=5
I, [2024-12-17T09:48:57.925019 #3989435]  INFO -- : master complete
==> ./log/unicorn.stderr.log <==
unknown OID 556291: failed to recognize type of 'embeddings'. It will be treated as String.
unknown OID 556178: failed to recognize type of 'embedding'. It will be treated as String.
unknown OID 556291: failed to recognize type of 'embeddings'. It will be treated as String.

然而
bundle exec config/unicorn_launcher -E production -c config/unicorn.conf.rb
却还活着

诡异在什么地方?

  • 如果是我 patch 的代码质量差导致的崩溃,时间应该不会那么不规律:有时候一整天都没挂,有时候一小时内频繁出现几次
  • 内存和负载都正常
free -h
               total        used        free      shared  buff/cache   available
Mem:            15Gi       7.9Gi       4.5Gi       584Mi       3.9Gi       7.7Gi
Swap:          8.0Gi       6.5Gi       1.5Gi

我决定把 unicorn 换成 puma, 这也是 Heroku 推荐的 Rails webserver.

问题来了,Discourse 官方没有使用 puma 的文档,看看 unicorn 的两个配置/脚本文件也是一头雾水:config/unicorn_launcherconfig/unicorn.conf.rb

我决定硬干,先从 Heroku 推荐的最简单的 puma 配置文件开始

# frozen_string_literal: true

if ENV["RAILS_ENV"] == "production"
  # First, you need to change these below to your situation.
  APP_ROOT = ENV["APP_ROOT"] || "/var/www/discourse"
  num_workers = ENV["NUM_WEBS"].to_i > 0 ? ENV["NUM_WEBS"].to_i : 8

  # Second, you can choose how many threads that you are going to run at same time.
  workers "#{num_workers}"
  threads 8, 32

  # Unless you know what you are changing, do not change them.
  # bind "unix://#{APP_ROOT}/tmp/sockets/puma.sock"

  stdout_redirect "#{APP_ROOT}/log/puma.log", "#{APP_ROOT}/log/puma.err.log"
  pidfile "#{APP_ROOT}/tmp/pids/puma.pid"
  state_path "#{APP_ROOT}/tmp/pids/puma.state"
  preload_app!

  port(ENV['PORT'] || 3000, "::")
  # Turn off keepalive support for better long tails response time with Router 2.0
  # Remove this line when https://github.com/puma/puma/issues/3487 is closed, and the fix is released
  enable_keep_alives(false) if respond_to?(:enable_keep_alives)

  rackup      DefaultRackup if defined?(DefaultRackup)
  environment ENV['RAILS_ENV'] || 'development'

  on_worker_boot do
    # Worker-specific setup for Rails 4.1 to 5.2, after 5.2 it's not needed
    # See: https://devcenter.heroku.com/articles/deploying-rails-applications-with-the-puma-web-server#on-worker-boot
    ActiveRecord::Base.establish_connection
  end
end

bundle exec puma -C config/puma.rb
或者直接
puma -C config/puma.rb

奇迹般地能运行

后续继续 观察稳定性

目前可能的不稳定性来源:

  1. 虽然我从 Docker 里面复制了全部文件皆 数据库 和 各种环境变量,配置,但可能存在遗漏
  2. Discourse 官方只支持 Postgres 13, 我一下子升级到了 17
  3. ruby 启用了 YJIT
  4. arm64 机器

发现了一个好玩的,puma 支持两种无缝重启的方式
我感觉后续升级 discourse 时可以做到真正 0-downtime

Search Labs | AI Overview

The main difference between a phased restart and a hot restart in Puma is how they handle connections and when they finish:

  • Phased restart

Puma keeps processing requests with old workers while sending new requests to new workers. This results in zero downtime and no hanging requests. However, phased restarts can’t be used to upgrade gems loaded by the Puma master process.

  • Hot restart

Puma tries to finish current requests and then restart itself with new workers. This results in no lost requests, but there may be some extra latency for new requests while the process restarts.

Here are some other differences between phased and hot restarts:

  • Speed: Hot restarts often complete more quickly than phased restarts.

  • Database schema upgrades: Phased restarts require backwards-compatible database schema upgrades.

  • Mode: Hot restarts work in a single mode, while phased restarts work in cluster mode.

or

今天让 Claude 把 Discourse 的 config/unicorn.rb 转成 config/puma.rb

AI 真牛逼

# frozen_string_literal: true

require "fileutils"
require 'puma/acme'

discourse_path = File.expand_path(File.expand_path(File.dirname(__FILE__)) + "/../")

enable_logstash_logger = ENV["ENABLE_LOGSTASH_LOGGER"] == "1"
puma_stderr_path = "#{discourse_path}/log/puma.stderr.log"
puma_stdout_path = "#{discourse_path}/log/puma.stdout.log"

# Load logstash logger if enabled
if enable_logstash_logger
  require_relative "../lib/discourse_logstash_logger"
  FileUtils.touch(puma_stderr_path) if !File.exist?(puma_stderr_path)
  # Note: You may need to adapt the logger initialization for Puma
  log_formatter = proc do |severity, time, progname, msg|
    event = {
      "@timestamp" => Time.now.utc,
      "message" => msg,
      "severity" => severity,
      "type" => "puma"
    }
    "#{event.to_json}\n"
  end
else
  stdout_redirect puma_stdout_path, puma_stderr_path, true
end

# Number of workers (processes)
workers ENV.fetch("PUMA_WORKERS", 8).to_i

# Set the directory
directory discourse_path

# Bind to the specified address and port
bind ENV.fetch("PUMA_BIND", "tcp://#{ENV['PUMA_BIND_ALL'] ? '' : '127.0.0.1:'}#{ENV.fetch('PUMA_PORT', 3000)}")

# or, use puma without reverse proxy
# require listening to privileged port
# `setcap 'cap_net_bind_service=ep' /home/discourse/.rvm/rubies/ruby-3.3.6/bin/ruby`

#bind 'tcp://0.0.0.0:80'
#plugin :acme
#acme_server_name 'xjtu.app'
#acme_tos_agreed true
#bind 'acme://0.0.0.0:443'

# PID file location
FileUtils.mkdir_p("#{discourse_path}/tmp/pids")
pidfile ENV.fetch("PUMA_PID_PATH", "#{discourse_path}/tmp/pids/puma.pid")

# State file - used by pumactl
state_path "#{discourse_path}/tmp/pids/puma.state"

# Environment-specific configuration
if ENV["RAILS_ENV"] == "production"
  # Production timeout
  worker_timeout 30
else
  # Development timeout
  worker_timeout ENV.fetch("PUMA_TIMEOUT", 60).to_i
end

# Preload application
preload_app!

# Handle worker boot and shutdown
before_fork do
  Discourse.preload_rails!
  Discourse.before_fork

  # Supervisor check
  supervisor_pid = ENV["PUMA_SUPERVISOR_PID"].to_i
  if supervisor_pid > 0
    Thread.new do
      loop do
        unless File.exist?("/proc/#{supervisor_pid}")
          puts "Kill self supervisor is gone"
          Process.kill "TERM", Process.pid
        end
        sleep 2
      end
    end
  end

  # Sidekiq workers
  sidekiqs = ENV["PUMA_SIDEKIQS"].to_i
  if sidekiqs > 0
    puts "starting #{sidekiqs} supervised sidekiqs"

    require "demon/sidekiq"
    Demon::Sidekiq.after_fork { DiscourseEvent.trigger(:sidekiq_fork_started) }
    Demon::Sidekiq.start(sidekiqs)

    if Discourse.enable_sidekiq_logging?
      Signal.trap("USR1") do
        # Delay Sidekiq log reopening
        sleep 1
        Demon::Sidekiq.kill("USR2")
      end
    end
  end

  # Email sync demon
  if ENV["DISCOURSE_ENABLE_EMAIL_SYNC_DEMON"] == "true"
    puts "starting up EmailSync demon"
    Demon::EmailSync.start(1)
  end

  # Plugin demons
  DiscoursePluginRegistry.demon_processes.each do |demon_class|
    puts "starting #{demon_class.prefix} demon"
    demon_class.start(1)
  end

  # Demon monitoring thread
  Thread.new do
    loop do
      begin
        sleep 60

        if sidekiqs > 0
          Demon::Sidekiq.ensure_running
          Demon::Sidekiq.heartbeat_check
          Demon::Sidekiq.rss_memory_check
        end

        if ENV["DISCOURSE_ENABLE_EMAIL_SYNC_DEMON"] == "true"
          Demon::EmailSync.ensure_running
          Demon::EmailSync.check_email_sync_heartbeat
        end

        DiscoursePluginRegistry.demon_processes.each(&:ensure_running)
      rescue => e
        Rails.logger.warn("Error in demon processes heartbeat check: #{e}\n#{e.backtrace.join("\n")}")
      end
    end
  end

  # Close Redis connection
  Discourse.redis.close
end

on_worker_boot do
  DiscourseEvent.trigger(:web_fork_started)
  Discourse.after_fork
end

# Worker timeout handling
worker_timeout 30

# Low-level worker options
threads 8, 32

改完之后,在线状态的显示正常了
另外,记录一个好玩的命令:

User.all.each do |u| PresenceChannel.new(DiscourseWhosOnline::CHANNEL_NAME).present(user_id: u.id, client_id: "seen") end

清空:

PresenceChannel.clear_all!

另外发现 Last IP 也正常了,不再是 ::ffff:127.0.0.1

「いいね!」 2