关于本站公开数据提供 SQLite 下载的讨论

全站拟公开文本数据 dump,为 SQLite 格式,下载 https://sqlitebrowser.org/ 即可在本地浏览,总共有 4 个表:

导出使用的脚本:

由于有門友投票反对,此事暂时搁置 :a_grinning_face_with_sweat:


本站和商业软件最大的区别之一是数据是用户完全所有的(民有、民治、民享),每个人都能在 https://xjtu.app/my/preferences/account 导出自己的全部数据(包括但不限于全部话题、回复、喜欢、书签、登陆记录)。

本站活跃度并不高,很大的原因是本来就多少东西好聊的,聚焦于校园生活以及本校的话题被聊完了就冇乜好聊的了。这就像情侣早期无话不说,到了后期成为夫妻以后就慢慢失去了聊天的兴趣一样。

将本站的公开数据公开有以下可能的好处:

  1. 用户可以用本地的数据库软件直接浏览原始数据,稍微懂点 Docker 的童鞋也可以在本地跑一个 Discourse 从而实现本机上的高速浏览(我在本地开发环境就曾导入过生产环境的数据),更有兴趣锻炼一下自己的技术的童鞋可以利用这个数据源开发一套别样的数据浏览乃至交互界面(如果只是浏览的话不涉及 CRUD 只需要会前端技术,哪怕不懂前端,做成静态网页放到 github.io 上也很容易,如果能支持交互的全栈软件就有点:ox:了,如果做成 P2P 的分布式软件就更 :bison:了)。

  2. 有望将现有的知识凝炼到从前年开始暴热的 ChatGPT 首当其冲的大语言模型中,2023 年 9 月数据量不太多的时候有人做过尝试(referenced below),从而做成 :egg: :egg:后最喜闻乐见的网络使用方式:客服。(用提问 prompt 代替搜索 search)
    知识图谱(Knowledge graph) 亦是一个对此有帮助的技术。
    试验:训练一个解答下交学生疑问的 Chatbot

  3. 万一各种想到或想不到的原因导致网站(可以称之为門門吗)去世,用户不失去自己的东西。从而不用在此刻惧怕丢失一切东西。

  4. 有关方面可以对数据进行研究,例如《上网的西安交通大学学生的精神状态研究》《时间/季节对网络上“打交”相关话题的流行度的影响》… …

公开数据是指全部话题和回复,至少除去了

  • 用户隐私信息(至少包括登陆记录,IP,邮箱)
  • 聊天(DM)内容(包含群聊(channel),私聊)
  • 私信(PM)内容。

公开公开数据的形式现在可以想到的是:

topic & posts > SQLite

Sam Saffron 大佬开源了现成的工具:GitHub - SamSaffron/discourse_public_import: A tool to import all public posts on discourse

风险:
提供便捷的全部公开数据下载的做法有风险,有人可以做个网站让全部数据一样,从而仿冒本站。考虑到交大門三个字以及 logo 也没申请 trademark,仿冒的话也没办法起诉 :sweat_smile:


是否反对公开本站公开数据?

  • 我反对 :raised_hand:
0 voters
2 Likes

:+1:

不反对。数据很少应该也不至于用来干坏事

1 Like

不反对,看起来都是能通过网页浏览获取的公开数据,真要有坏人自己整个爬虫也就爬出来了。

1 Like

现在纯文档 dump 完就 10M 的一个 SQLite,这下真的可以做一个离线交大門了

这个得谨慎,许多看上去无害的表存了很多隐私数据

包括且不限于 plugin_store_rows、post_search_data、push_subscriptions、topic_users、topic_views、post_timings、topic_link_clicks

全是重量级

1 Like

感谢 pangbo 大佬提醒,我目前只导出了公开类别的四个表的部分列,看了一下没啥隐私信息。

查了一下,这些表里确实有挺多信息的,这些没有必要导出。有些有 IP,有些甚至有消息的推送地址。

1 Like

plugin_store_rows 部分插件会把用户数据存这个表里,与插件有关
post_search_data search_log 会记录用户的搜索关键词记录
push_subscriptions 会记录用户推送的相关信息(可能包含认证信息)
topic_users 包含用户的书签、话题阅读时间和进度
post_timings 最重量级的一个表,记录了每个用户阅读每个 post 的时间
topic_link_clicks 记录了用户点击链接的记录

posts、topics、users 这几个表用白名单过滤一下列,感觉问题不大。另外有可能会导出被删除的内容,如果介意的话可以过滤一下 deleted_at。
likes 是从 post_actions 里过滤出来的吗,水源没有这个表

1 Like

感谢详细的解读,likes 是 post_actions 重名了个名

1 Like

真是民民又主主啊 :shuiyuan3: :grin_hugging_face:

1 Like

我们交大门一直是全过程门友民主

2 Likes

支持本站在对数据匿名处理后公开。本站用户主要是大学生群体,提供的数据和训练的模型能为把握青年思想和网络传播规律提供参考,得到的成果可能为引导大学生的正确思想,舆情管理和维护社会稳定做出贡献。

用 Claude3.7Sonnt 把 SQL 查询转成了 rake task, 直接导出为 json

lib/tasks/custom.rake

# frozen_string_literal: true
desc 'add external_id for all topics and posts'
task 'custom:add-external-id', [:override_existing] => :environment do |task, args|
  # use `rake 'custom:add-external-id[1]'` to override topics and posts' existing external_id
  # ensure the first post inside topic has the same external_id with the topic
  require 'parallel'
  require 'securerandom'
  Parallel.each(Post.all, progress: "Posts") do |post|
    if args[:override_existing].present? || post.external_id.blank?
      post.update_column(:external_id, SecureRandom.alphanumeric(SiteSetting.external_id_length))
      if post.post_number == 1
        topic = Topic.find(post.topic_id)
        topic.update_column(:external_id, post.external_id)
      end
    end
  end
end

# rake custom:export-users > users.json
# rake custom:export-users > /home/discourse/public-export/users.json
desc "Export all users without sensitive data"
task "custom:export-users" => :environment do
  require 'json'

  a = []
  User.find_each(batch_size: 100_000)  do |user|
    payload = {  id: user.id, username: user.username, name: user.name,
                 admin:user.admin, moderator:user.moderator, trust_level: user.trust_level,
                 avatar_template: user.avatar_template, title: user.title,
                 groups: user.groups.map{|i| i.name}, locale: user.locale,
                 silenced_till: user.silenced_till , staged: user.staged, active: user.active,
                 created_at:user.created_at.to_i, updated_at:user.updated_at.to_i }
    a.push payload
  end
  puts a.to_json
end

# rake "custom:export-posts[0,/home/discourse/public-export/posts.json]"
desc "Export posts data from non-restricted categories"
task "custom:export-posts", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting posts with ID > #{min_id}..."

  base_scope = Post.joins(:topic)
                   .joins("JOIN categories c ON c.id = topics.category_id")
                   .where("NOT c.read_restricted")
                   .where("topics.deleted_at IS NULL")
                   .where("posts.deleted_at IS NULL")
                   .where(post_type: 1)
                   .where(hidden: false)
                   .where("posts.id > ?", min_id)

  # Count without the selected columns
  total = base_scope.count
  puts "Found #{total} posts to export"

  posts_data = []

  # Now add the select and order for the actual data retrieval
  scope = base_scope.select("posts.id, posts.raw, posts.cooked, posts.post_number,
                  posts.topic_id, posts.user_id, posts.created_at,
                  posts.updated_at, posts.reply_to_post_number,
                  posts.reply_to_user_id, posts.reply_count,
                  topics.like_count, topics.word_count")
                    .order("posts.id ASC")

  progress = 0

  scope.find_each(batch_size: 1000) do |post|
    posts_data << {
      id: post.id,
      raw: post.raw,
      cooked: post.cooked,
      post_number: post.post_number,
      topic_id: post.topic_id,
      user_id: post.user_id,
      created_at: post.created_at,
      updated_at: post.updated_at,
      reply_to_post_number: post.reply_to_post_number,
      reply_to_user_id: post.reply_to_user_id,
      reply_count: post.reply_count,
      like_count: post.like_count,
      word_count: post.word_count
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} posts"
    end
  end

  result = posts_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total posts: #{posts_data.size}"
end


# rake "custom:export-likes[0,/home/discourse/public-export/likes.json]"
desc "Export post likes data from non-restricted categories"
task "custom:export-likes", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting post likes with ID > #{min_id}..."

  # First, get the qualifying post IDs
  qualifying_posts = Post.joins(:topic)
                         .joins("JOIN categories c ON c.id = topics.category_id")
                         .where("NOT c.read_restricted")
                         .where("topics.deleted_at IS NULL")
                         .where("posts.deleted_at IS NULL")
                         .where(post_type: 1)
                         .where(hidden: false)
                         .pluck(:id)

  puts "Found #{qualifying_posts.size} qualifying posts"

  # Now get the likes for these posts
  base_scope = PostAction
                 .where(post_action_type_id: 2) # 2 is the 'like' action type
                 .where(deleted_at: nil)
                 .where("id > ?", min_id)
                 .where(post_id: qualifying_posts)

  # Count total likes that match criteria
  total = base_scope.count
  puts "Found #{total} likes to export"

  # Exit early if nothing to export
  if total == 0
    puts "No likes to export."
    return
  end

  # Get the data with ordering
  likes_scope = base_scope
                  .select(:id, :post_id, :user_id, :created_at)
                  .order(id: :asc)

  likes_data = []
  progress = 0

  # Process in batches to avoid memory issues
  likes_scope.find_each(batch_size: 1000) do |like|
    likes_data << {
      post_id: like.post_id,
      user_id: like.user_id,
      created_at: like.created_at
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} likes"
    end
  end

  result = likes_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total likes: #{likes_data.size}"
end


# rake "custom:export-topics[0,/home/discourse/public-export/topics.json]"
desc "Export topics data from non-restricted categories"
task "custom:export-topics", [:min_id, :output_file] => :environment do |_, args|
  require 'json'

  min_id = (args[:min_id] || 0).to_i
  output_file = args[:output_file]

  puts "Exporting topics with ID > #{min_id}..."

  # Create base scope for the query
  base_scope = Topic.joins(:category)
                    .where("NOT categories.read_restricted")
                    .where("topics.deleted_at IS NULL")
                    .where(archetype: 'regular')
                    .where("topics.id > ?", min_id)

  # Count records without including all the selected fields
  total = base_scope.count
  puts "Found #{total} topics to export"

  # Exit early if nothing to export
  if total == 0
    puts "No topics to export."
    return
  end

  topics_data = []
  progress = 0

  # Process in batches to avoid memory issues
  base_scope.includes(:tags)
            .order(id: :asc)
            .find_each(batch_size: 1000) do |topic|
    # Get tags as a comma-separated string
    topics_data << {
      id: topic.id,
      category_name: topic.category.name,
      category_id: topic.category_id,
      title: topic.title,
      excerpt: topic.excerpt,
      created_at: topic.created_at,
      last_posted_at: topic.last_posted_at,
      updated_at: topic.updated_at,
      views: topic.views,
      posts_count: topic.posts_count,
      like_count: topic.like_count,
      user_id: topic.user_id,
      last_post_user_id: topic.last_post_user_id,
      tags: topic.tags.map{|i| i.name},
    }

    progress += 1
    if progress % 1000 == 0
      puts "Processed #{progress}/#{total} topics"
    end
  end

  result = topics_data.to_json

  if output_file
    File.write(output_file, result)
    puts "Exported data to #{output_file}"
  else
    puts result
  end

  puts "Export completed. Total topics: #{topics_data.size}"
end

# rake "custom:export-all[0,/home/discourse/public-export]"
desc "Export all data (topics, posts, likes, users) from non-restricted categories"
task "custom:export-all", [:min_id, :output_dir] => :environment do |_, args|
  min_id = args[:min_id] || 0
  output_dir = args[:output_dir] || "/home/discourse/public-export"

  # Ensure output directory exists
  FileUtils.mkdir_p(output_dir) unless Dir.exist?(output_dir)

  # Define output file paths
  topics_file = File.join(output_dir, "topics.json")
  posts_file = File.join(output_dir, "posts.json")
  likes_file = File.join(output_dir, "likes.json")
  users_file = File.join(output_dir, "users.json")

  puts "Beginning export of all data to #{output_dir}..."

  # Export topics
  puts "\n=== Exporting Topics ==="
  Rake::Task["custom:export-topics"].invoke(min_id, topics_file)
  Rake::Task["custom:export-topics"].reenable

  # Export posts
  puts "\n=== Exporting Posts ==="
  Rake::Task["custom:export-posts"].invoke(min_id, posts_file)
  Rake::Task["custom:export-posts"].reenable

  # Export likes
  puts "\n=== Exporting Likes ==="
  Rake::Task["custom:export-likes"].invoke(min_id, likes_file)
  Rake::Task["custom:export-likes"].reenable

  # Export users
  puts "\n=== Exporting Users ==="
  # Redirect stdout to file since the users task prints to stdout
  original_stdout = $stdout
  File.open(users_file, 'w') do |f|
    $stdout = f
    Rake::Task["custom:export-users"].invoke
    Rake::Task["custom:export-users"].reenable
  end
  $stdout = original_stdout

  puts "\n=== Export Complete ==="
  puts "Topics exported to: #{topics_file}"
  puts "Posts exported to: #{posts_file}"
  puts "Likes exported to: #{likes_file}"
  puts "Users exported to: #{users_file}"
end