关于本站公开数据提供 SQLite 下载的讨论

shared_user · May 27, 2024, 3:56pm

全站拟公开文本数据 dump，为 SQLite 格式，下载 https://sqlitebrowser.org/ 即可在本地浏览，总共有 4 个表：

导出使用的脚本：

github.com/xjtumen/discourse_public_import

download_topics.rb

main

require "json"
require "net/http"
require "uri"
require "sqlite3"
require "mini_sql"
require "cgi"

begin
  config = JSON.parse(File.read("config.json"))
rescue StandardError
  puts "Please create a file called .creds with your API KEY and USERNAME"
end

# Replace these values with your Discourse instance details
DISCOURSE_DOMAIN = config["domain"]
API_KEY = config["api_key"]
API_USERNAME = config["api_username"]
TOPIC_QUERY_ID = config["topics_query_id"]
USER_QUERY_ID = config["users_query_id"]
POST_QUERY_ID = config["posts_query_id"]

This file has been truncated. show original

~~由于有門友投票反对，此事暂时搁置。~~

本站和商业软件最大的区别之一是数据是用户完全所有的（民有、民治、民享），每个人都能在 https://xjtu.app/my/preferences/account 导出自己的全部数据（包括但不限于全部话题、回复、喜欢、书签、登陆记录）。

本站活跃度并不高，很大的原因是本来就冇多少东西好聊的，聚焦于校园生活以及本校的话题被聊完了就冇乜好聊的了。这就像情侣早期无话不说，到了后期成为夫妻以后就慢慢失去了聊天的兴趣一样。

将本站的公开数据公开有以下可能的好处：

用户可以用本地的数据库软件直接浏览原始数据，稍微懂点 Docker 的童鞋也可以在本地跑一个 Discourse 从而实现本机上的高速浏览（我在本地开发环境就曾导入过生产环境的数据），更有兴趣锻炼一下自己的技术的童鞋可以利用这个数据源开发一套别样的数据浏览乃至交互界面（如果只是浏览的话不涉及 CRUD 只需要会前端技术，哪怕不懂前端，做成静态网页放到 github.io 上也很容易，如果能支持交互的全栈软件就有点了，如果做成 P2P 的分布式软件就更了）。
有望将现有的知识凝炼到从前年开始暴热的 ChatGPT 首当其冲的大语言模型中，2023 年 9 月数据量不太多的时候有人做过尝试（referenced below），从而做成后最喜闻乐见的网络使用方式：客服。（用提问 prompt 代替搜索 search）
知识图谱（Knowledge graph）亦是一个对此有帮助的技术。
试验：训练一个解答下交学生疑问的 Chatbot
万一各种想到或想不到的原因导致网站（可以称之为門門吗）去世，用户不失去自己的东西。从而不用在此刻惧怕丢失一切东西。
有关方面可以对数据进行研究，例如《上网的西安交通大学学生的精神状态研究》《时间/季节对网络上“打交”相关话题的流行度的影响》… …

公开数据是指全部话题和回复，至少除去了

用户隐私信息（至少包括登陆记录，IP，邮箱）
聊天（DM）内容（包含群聊（channel），私聊）
私信（PM）内容。

公开公开数据的形式现在可以想到的是：

topic & posts > SQLite

Sam Saffron 大佬开源了现成的工具：GitHub - SamSaffron/discourse_public_import: A tool to import all public posts on discourse

风险：
提供便捷的全部公开数据下载的做法有风险，有人可以做个网站让全部数据一样，从而仿冒本站。考虑到交大門三个字以及 logo 也没申请 trademark，仿冒的话也没办法起诉

是否反对公开本站公开数据？

我反对

0 voters

lwe · May 27, 2024, 5:04pm

hoist · May 28, 2024, 12:39am

不反对。数据很少应该也不至于用来干坏事

sola · May 28, 2024, 3:33pm

不反对，看起来都是能通过网页浏览获取的公开数据，真要有坏人自己整个爬虫也就爬出来了。

shared_user · May 28, 2024, 3:36pm

现在纯文档 dump 完就 10M 的一个 SQLite，这下真的可以做一个离线交大門了

pangbo · May 28, 2024, 3:37pm

这个得谨慎，许多看上去无害的表存了很多隐私数据

包括且不限于 plugin_store_rows、post_search_data、push_subscriptions、topic_users、topic_views、post_timings、topic_link_clicks

全是重量级

shared_user · May 28, 2024, 3:39pm

感谢 pangbo 大佬提醒，我目前只导出了公开类别的四个表的部分列，看了一下没啥隐私信息。

查了一下，这些表里确实有挺多信息的，这些没有必要导出。有些有 IP，有些甚至有消息的推送地址。

pangbo · May 28, 2024, 3:50pm

plugin_store_rows 部分插件会把用户数据存这个表里，与插件有关
~~post_search_data~~ search_log 会记录用户的搜索关键词记录
push_subscriptions 会记录用户推送的相关信息（可能包含认证信息）
topic_users 包含用户的书签、话题阅读时间和进度
post_timings 最重量级的一个表，记录了每个用户阅读每个 post 的时间
topic_link_clicks 记录了用户点击链接的记录

posts、topics、users 这几个表用白名单过滤一下列，感觉问题不大。另外有可能会导出被删除的内容，如果介意的话可以过滤一下 deleted_at。
likes 是从 post_actions 里过滤出来的吗，水源没有这个表

shared_user · May 28, 2024, 3:54pm

感谢详细的解读，likes 是 post_actions 重名了个名

john · May 28, 2024, 4:17pm

真是民民又主主啊

shared_user · May 28, 2024, 5:03pm

我们交大门一直是全过程门友民主

戴跃进 · December 5, 2024, 5:13am

支持本站在对数据匿名处理后公开。本站用户主要是大学生群体，提供的数据和训练的模型能为把握青年思想和网络传播规律提供参考，得到的成果可能为引导大学生的正确思想，舆情管理和维护社会稳定做出贡献。

shared_user · May 17, 2025, 7:13am

用 Claude3.7Sonnt 把 SQL 查询转成了 rake task, 直接导出为 json

lib/tasks/custom.rake

# frozen_string_literal: true
desc 'add external_id for all topics and posts'
task 'custom:add-external-id', [:override_existing] => :environment do |task, args|
  # use `rake 'custom:add-external-id[1]'` to override topics and posts' existing external_id
  # ensure the first post inside topic has the same external_id with the topic
  require 'parallel'
  require 'securerandom'
  Parallel.each(Post.all, progress: "Posts") do |post|
    if args[:override_existing].present? || post.external_id.blank?
      post.update_column(:external_id, SecureRandom.alphanumeric(SiteSetting.external_id_length))
      if post.post_number == 1
        topic = Topic.find(post.topic_id)
        topic.update_column(:external_id, post.external_id)
      end
    end
  end
end

# rake custom:export-users > users.json
# rake custom:export-users > /home/discourse/public-export/users.json
desc "Export all users without sensitive data"
task "custom:export-users" => :environment do
  require 'json'

  a = []
  User.find_each(batch_size: 100_000)  do |user|
    payload = {  id: user.id, username: user.username, name: user.name,
                 admin:user.admin, moderator:user.moderator, trust_level: user.trust_level,
                 avatar_template: user.avatar_template, title: user.title,
                 groups: user.groups.map{|i| i.name}, locale: user.locale,
                 silenced_till: user.silenced_till , staged: user.staged, active: user.active,
                 created_at:user.created_at.to_i, updated_at:user.updated_at.to_i }
    a.push payload
  end
  puts a.to_json
end

# rake "custom:export-posts[0,/home/discourse/public-export/posts.json]"
desc "Export posts data from non-restricted categories"
task "custom:export-posts", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting posts with ID > #{min_id}..."

  base_scope = Post.joins(:topic)
                   .joins("JOIN categories c ON c.id = topics.category_id")
                   .where("NOT c.read_restricted")
                   .where("topics.deleted_at IS NULL")
                   .where("posts.deleted_at IS NULL")
                   .where(post_type: 1)
                   .where(hidden: false)
                   .where("posts.id > ?", min_id)

  # Count without the selected columns
  total = base_scope.count
  puts "Found #{total} posts to export"

  posts_data = []

  # Now add the select and order for the actual data retrieval
  scope = base_scope.select("posts.id, posts.raw, posts.cooked, posts.post_number,
                  posts.topic_id, posts.user_id, posts.created_at,
                  posts.updated_at, posts.reply_to_post_number,
                  posts.reply_to_user_id, posts.reply_count,
                  topics.like_count, topics.word_count")
                    .order("posts.id ASC")

  progress = 0

  scope.find_each(batch_size: 1000) do |post|
    posts_data << {
      id: post.id,
      raw: post.raw,
      cooked: post.cooked,
      post_number: post.post_number,
      topic_id: post.topic_id,
      user_id: post.user_id,
      created_at: post.created_at,
      updated_at: post.updated_at,
      reply_to_post_number: post.reply_to_post_number,
      reply_to_user_id: post.reply_to_user_id,
      reply_count: post.reply_count,
      like_count: post.like_count,
      word_count: post.word_count
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} posts"
    end
  end

  result = posts_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total posts: #{posts_data.size}"
end


# rake "custom:export-likes[0,/home/discourse/public-export/likes.json]"
desc "Export post likes data from non-restricted categories"
task "custom:export-likes", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting post likes with ID > #{min_id}..."

  # First, get the qualifying post IDs
  qualifying_posts = Post.joins(:topic)
                         .joins("JOIN categories c ON c.id = topics.category_id")
                         .where("NOT c.read_restricted")
                         .where("topics.deleted_at IS NULL")
                         .where("posts.deleted_at IS NULL")
                         .where(post_type: 1)
                         .where(hidden: false)
                         .pluck(:id)

  puts "Found #{qualifying_posts.size} qualifying posts"

  # Now get the likes for these posts
  base_scope = PostAction
                 .where(post_action_type_id: 2) # 2 is the 'like' action type
                 .where(deleted_at: nil)
                 .where("id > ?", min_id)
                 .where(post_id: qualifying_posts)

  # Count total likes that match criteria
  total = base_scope.count
  puts "Found #{total} likes to export"

  # Exit early if nothing to export
  if total == 0
    puts "No likes to export."
    return
  end

  # Get the data with ordering
  likes_scope = base_scope
                  .select(:id, :post_id, :user_id, :created_at)
                  .order(id: :asc)

  likes_data = []
  progress = 0

  # Process in batches to avoid memory issues
  likes_scope.find_each(batch_size: 1000) do |like|
    likes_data << {
      post_id: like.post_id,
      user_id: like.user_id,
      created_at: like.created_at
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} likes"
    end
  end

  result = likes_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total likes: #{likes_data.size}"
end


# rake "custom:export-topics[0,/home/discourse/public-export/topics.json]"
desc "Export topics data from non-restricted categories"
task "custom:export-topics", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting topics with ID > #{min_id}..."

  # Create base scope for the query
  base_scope = Topic.joins(:category)
                    .where("NOT categories.read_restricted")
                    .where("topics.deleted_at IS NULL")
                    .where(archetype: 'regular')
                    .where("topics.id > ?", min_id)

  # Count records without including all the selected fields
  total = base_scope.count
  puts "Found #{total} topics to export"

  # Exit early if nothing to export
  if total == 0
    puts "No topics to export."
    return
  end

  topics_data = []
  progress = 0

  # Process in batches to avoid memory issues
  base_scope.includes(:tags)
            .order(id: :asc)
            .find_each(batch_size: 1000) do |topic|
    # Get tags as a comma-separated string
    topics_data << {
      id: topic.id,
      category_name: topic.category.name,
      category_id: topic.category_id,
      title: topic.title,
      excerpt: topic.excerpt,
      created_at: topic.created_at,
      last_posted_at: topic.last_posted_at,
      updated_at: topic.updated_at,
      views: topic.views,
      posts_count: topic.posts_count,
      like_count: topic.like_count,
      user_id: topic.user_id,
      last_post_user_id: topic.last_post_user_id,
      tags: topic.tags.map{|i| i.name},
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} topics"
    end
  end

  result = topics_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total topics: #{topics_data.size}"
end

# rake "custom:export-all[0,/home/discourse/public-export]"
desc "Export all data (topics, posts, likes, users) from non-restricted categories"
task "custom:export-all", [:min_id, :output_dir] => :environment do |_, args|
  min_id = args[:min_id] || 0
  output_dir = args[:output_dir] || "/home/discourse/public-export"

  # Ensure output directory exists
  FileUtils.mkdir_p(output_dir) unless Dir.exist?(output_dir)

  # Define output file paths
  topics_file = File.join(output_dir, "topics.json")
  posts_file = File.join(output_dir, "posts.json")
  likes_file = File.join(output_dir, "likes.json")
  users_file = File.join(output_dir, "users.json")

  puts "Beginning export of all data to #{output_dir}..."

  # Export topics
  puts "\n=== Exporting Topics ==="
  Rake::Task["custom:export-topics"].invoke(min_id, topics_file)
  Rake::Task["custom:export-topics"].reenable

  # Export posts
  puts "\n=== Exporting Posts ==="
  Rake::Task["custom:export-posts"].invoke(min_id, posts_file)
  Rake::Task["custom:export-posts"].reenable

  # Export likes
  puts "\n=== Exporting Likes ==="
  Rake::Task["custom:export-likes"].invoke(min_id, likes_file)
  Rake::Task["custom:export-likes"].reenable

  # Export users
  puts "\n=== Exporting Users ==="
  # Redirect stdout to file since the users task prints to stdout
  original_stdout = $stdout
  File.open(users_file, 'w') do |f|
    $stdout = f
    Rake::Task["custom:export-users"].invoke
    Rake::Task["custom:export-users"].reenable
  end
  $stdout = original_stdout

  puts "\n=== Export Complete ==="
  puts "Topics exported to: #{topics_file}"
  puts "Posts exported to: #{posts_file}"
  puts "Likes exported to: #{likes_file}"
  puts "Users exported to: #{users_file}"
end

Topic		Replies	Views
将 Discourse 嵌入到其他网站作为评论系统网站 discourse	0	47	August 21, 2023
gpt-4o 写高考作文 2024 版 - 北京爷谈笑风生	7	55	June 9, 2024
西安交大生存指南引流&宣传通用&CS攻略 survive-cs , survive	15	702	July 9, 2024
为什么平平淡淡才是你站的常态网站	19	257	June 19, 2025
ChatGPT：揭示某些计算机本科和顶尖本科之间的显著差异深入交流	9	628	July 31, 2023
发现自己的 paper 被牛组/公司撞车是什么感觉谈笑风生	9	239	March 28, 2024
CS162 学习记录学术专区	85	604	November 4, 2024
你未来的理想生活是什么样的谈笑风生	270	231	July 6, 2025
Release 模式下加载 cow.dae 闪退计算机图形学	2	22	December 18, 2024
【提问】关于水硕问答	7	150	March 8, 2024

关于本站公开数据提供 SQLite 下载的讨论

Related topics