调整 real estate 故事创作逻辑

2026-02-23 14:49:10 +08:00
parent 6008a7ff4b
commit c1e5db64c1
5 changed files with 173 additions and 6 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1,2 @@
 __pycache__
+logs
--- a/llm/prompts/a.txt
+++ b/llm/prompts/a.txt
@ -0,0 +1,24 @@
+# 角色
+你是一位擅长用数据解构故事的**房产数据分析师**。你喜欢从一个个真实的买房/卖房故事里，提炼出普通人看不见的**数据真相**和**隐藏成本模型**。
+
+# 任务
+根据我提供的**真实购房故事素材**，生成一篇故事型资讯微头条。
+
+# 内容结构要求
+1. **标题（情绪钩子）**：用数字冲突或扎心场景吸引点击，例如“XX万买房，XX万装修，这笔账算下来我沉默了”。
+2. **故事引入（第一人称或第三人称）**：
+   - 简述事件经过，突出关键数字（房价、欠费、装修费）。
+   - 加入细节让故事真实可感（如拖欠几年、具体改造项目）。
+3. **数据分析师视角（核心）**：
+   - 将故事拆解成数据维度：总持有成本、隐藏成本占比、机会成本对比。
+   - 用计算方式呈现：真实成本 = 房款 + 欠费 + 装修硬成本 + 资金占用成本。
+   - 可拉清单、做对比，展示“看似捡漏实则踩坑”的数据真相。
+4. **模型化输出**：提炼出一个公式或避坑原则（如“老破小持有成本模型”），让读者能套用。
+
+# 风格要求
+- 字数400-600字。
+- 语言平实但有张力，多用短句和具体数字。
+- 结尾要有启发感，让读者觉得“学到了”。
+
+请根据我提供的素材，生成一篇微头条。
+
--- a/llm/prompts/real_estate_story_selection_system_prompt.txt
+++ b/llm/prompts/real_estate_story_selection_system_prompt.txt
@ -0,0 +1 @@
+以下是一些帖子的内容，请从中选择2条最适合创作故事的帖子，回复格式为数字编号，例如：1,3
--- a/llm/prompts/wechat_official_account_system_prompt.txt
+++ b/llm/prompts/wechat_official_account_system_prompt.txt
@ -0,0 +1,46 @@
+# 角色设定
+你是一位擅长写“城市青年生活故事”的公众号主笔。你的文风细腻、真实、有代入感，擅长用第一人称视角，把普通人的日常生活写出电影感。你的读者是25-35岁的一二线城市年轻人，他们焦虑、务实、渴望被理解。
+
+# 核心要求
+请根据我提供的素材，写一篇公众号文章。文章需要满足以下标准：
+
+## 1. 标题策略（三选一，根据素材情绪选择）
+- 悬念型：用具体数字+矛盾情绪制造好奇（如“我，XX岁，在广州，花了XXX万做了一个疯狂的决定”）
+- 共鸣型：直击目标人群的集体焦虑（如“放弃二手房后，我才明白这件事比房价更可怕”）
+- 故事型：像朋友聊天一样自然（如“折腾了一年，我终于在XX买到了房”）
+
+## 2. 开头技巧
+必须用“场景+情绪”切入。从一个具体的瞬间开始写（比如签合同的那一刻、看房那天的大雨、收到扣款短信的清晨），用细节（雨声、手抖、手机屏幕、父母的某个表情）把读者立刻拉进故事里。第一段结尾最好抛出一个“钩子”——一个疑问，一种不安，让读者想知道“然后呢”。
+
+## 3. 正文结构
+按照“起承转合”的节奏推进：
+- 起：交代背景（我是谁、为什么要做这件事、我的条件和底线）
+- 承：展开过程（看过哪些盘、遇到了什么纠结、有哪些真实的生活细节）
+- 转：出现转折（为什么放弃了之前的选项、最终选择是什么、做了哪些妥协）
+- 合：当下状态（尘埃落定后的真实感受，不煽情，但要诚恳）
+
+## 4. 核心技巧
+- **细节代替形容词**：不说“房子很旧”，要说“楼道里的墙皮有点剥落，灯管发黄”；不说“我很焦虑”，要说“我盯着手机上的扣款短信，手有点抖”
+- **对话感**：适当加入人物对话（中介说的某句话、爸妈的某个建议），让文章活起来
+- **情绪曲线**：开头可以有迷茫/焦虑 → 中间有挣扎/纠结 → 结尾可以是平静/坦然，甚至带一点自嘲
+- **金句收尾**：每段的最后一句话，尽量能让人想截图发朋友圈
+
+## 5. 结尾策略
+- 不强行上价值，不说“所以我们要勇敢”
+- 回归真实：我不知道对不对、不知道会不会后悔、但这是我选的路
+- 加一个“钩子”：如果你也有类似的经历/困惑，欢迎来找我聊（引出私域/社群）
+
+## 6. 风格词汇库（可混用）
+- 具象词汇：雨、手抖、手机屏幕、父母的手、售楼部的灯、工地上的塔吊、导航截图、周末的看房路
+- 情绪词汇：空落落的、喘不过气、心里发怵、咬了咬牙、不敢赌、就这样吧、其实也挺好
+- 动作词汇：攥着我的手、签下名字、划走扣款、推开售楼部的门、站在阳台上往下看
+
+# 输出要求
+- 全文1500-2500字
+- 分段清晰，每段3-5行
+- 适当使用短句，制造呼吸感
+- 以JSON格式输出，包含字段：title(标题)，body(正文)。
+
+---
+
+现在，请根据以下素材，写一篇符合上述要求的公众号文章。
--- a/task/hot_topic/real_estate_story.py
+++ b/task/hot_topic/real_estate_story.py
@ -1,4 +1,4 @@
-from datetime import datetime, timedelta
+from datetime import datetime
 import json
 from task.manager_task import execute_task
 from config.database import SessionLocal
@ -9,17 +9,112 @@ from llm import LLMThinkingEngine

 def story_edit_task():
    with SessionLocal() as db:
-        # 获取今天最新的帖子（限定，最多10条）
+        # 获取今天的帖子（限定，最多50条）
        today_contents = db.query(SourceContent).filter(
            SourceContent.create_time >= (datetime.today().replace(hour=0, minute=0, second=0, microsecond=0))
-        ).order_by(SourceContent.create_time.desc()).limit(10).all()
+        ).order_by(SourceContent.create_time.desc()).limit(50).all()
        if len(today_contents) == 0:
            logger.info("story_edit_task finish, content size 0")
            return
        logger.info(f"story_edit_task get {len(today_contents)} contents")

+        # 按照帖子正文字数排序
+        # 定义提取函数：解析JSON并返回content字段长度
+        def get_content_length(item):
+            try:
+                if not item.content:
+                    return 0
+                data = json.loads(item.content)
+                # 安全获取 content 字段，避免 None
+                body = data.get('content') or ''
+                return len(body)
+            except (json.JSONDecodeError, TypeError, AttributeError):
+                return 0
+        today_contents.sort(key=lambda x: get_content_length(x), reverse=True)
+        
+        # 去掉帖子正文字数小于200的帖子
+        to_processed_contents = [content for content in today_contents if get_content_length(content) >= 200]
+        logger.info(f"story_edit_task after filter content size {len(to_processed_contents)}")
+
+        # 如果没有符合条件的帖子，直接使用字数最多的帖子（即使它的字数小于200）
+        if len(to_processed_contents) == 0 and len(today_contents) > 0:
+            to_processed_contents = [today_contents[0]]
+
+        # 下面会调用LLM对帖子进行筛选，此处限定所有帖子的正文字数之和不超过10000字（成本安全考虑）
+        total_length = sum(get_content_length(content) for content in to_processed_contents)
+        if total_length > 10000:
+            # 从字数最多的帖子开始，逐步移除，直到总字数不超过10000
+            while total_length > 10000 and to_processed_contents:
+                removed_content = to_processed_contents.pop()
+                total_length -= get_content_length(removed_content)
+
+        # 如果to_processed_contents数量超过2条，则让LLM从中选择2条最适合创作故事的帖子
+        # 定义提取函数：解析JSON并返回content内容
+        def get_content(item):
+            try:
+                if not item.content:
+                    return ""
+                data = json.loads(item.content)
+                # 安全获取 content 字段，避免 None
+                body = data.get('content') or ''
+                return body
+            except (json.JSONDecodeError, TypeError, AttributeError):
+                return ""
+        if len(to_processed_contents) > 2:
+            llm_engine = LLMThinkingEngine(system_prompt_file="real_estate_story_selection_system_prompt.txt")
+            content_list_str = "\n".join([f"{idx+1}. {get_content(content)}" for idx, content in enumerate(to_processed_contents)])
+            logger.info(f"story_edit_task LLM selection content list: {content_list_str}")
+            selection_result = llm_engine.think(content_list_str)
+            logger.info(f"story_edit_task LLM selection result: {selection_result}")
+            # 解析LLM的选择结果，提取出数字编号
+            selected_indices = []
+            for part in selection_result.split(","):
+                part = part.strip()
+                if part.isdigit():
+                    idx = int(part) - 1
+                    if 0 <= idx < len(to_processed_contents):
+                        selected_indices.append(idx)
+                if len(selected_indices) >= 2:
+                    break
+            to_processed_contents = [to_processed_contents[idx] for idx in selected_indices]
+            logger.info(f"story_edit_task after LLM selection content size {len(to_processed_contents)}")
+        
+        # 下面是对筛选后的帖子进行故事创作，目前先处理一条内容，后续再改成批量处理
+        llm_engine = LLMThinkingEngine(system_prompt_file="wechat_official_account_system_prompt.txt")
+        for content in to_processed_contents:
+            logger.info(f"story_edit_task content id: {content.id}, title: {content.link}, platform: {content.platform}")
+            story = llm_engine.think(f"【素材内容】\n{content.content}")
+            logger.info(f"story_edit_task content id: {content.id} story: {story}")
+            # llm生成的结果有时不是json结构，会在前后增加一些文本，需要提取出json部分进行解析
+            try:
+                json_start = story.find("{")
+                json_end = story.rfind("}") + 1
+                if json_start != -1 and json_end != -1:
+                    story = story[json_start:json_end]
+                else:
+                    logger.warning(f"story_edit_task content id: {content.id} llm生成的结果不是有效的json格式，无法提取故事内容")
+                    continue
+            except json.JSONDecodeError:
+                logger.warning(f"story_edit_task content id: {content.id} llm生成的结果不是有效的json格式，无法解析故事内容")
+                continue
+            # 将生成的故事写入Article表
+            json_story = json.loads(story)
+            title = json_story.get("title", "无标题")
+            article_content = json_story.get("body", "无内容")
+            # article_content有连续多个换行的情况，替换成单个换行
+            # article_content = "\n".join([line.strip() for line in article_content.splitlines() if line.strip()])
+            article = Article(
+                title=title,
+                keywords=None,
+                content=article_content,
+                used=False
+            )
+            db.add(article)
+            db.commit()
+            # break  # 目前先处理一条内容，后续再改成批量处理
+
        llm_engine = LLMThinkingEngine(system_prompt_file="real_estate_story_system_prompt.txt")
-        for content in today_contents:
+        for content in to_processed_contents:
            logger.info(f"story_edit_task content id: {content.id}, title: {content.link}, platform: {content.platform}")
            story = llm_engine.think(f"故事素材：{content.content}")
            logger.info(f"story_edit_task content id: {content.id} story: {story}")
@ -51,7 +146,7 @@ def story_edit_task():
            # break  # 目前先处理一条内容，后续再改成批量处理
        
        llm_engine = LLMThinkingEngine(system_prompt_file="real_estate_story_short_system_prompt.txt")
-        for content in today_contents:
+        for content in to_processed_contents:
            logger.info(f"story_edit_task content id: {content.id}, title: {content.link}, platform: {content.platform}")
            story = llm_engine.think(f"故事素材：{content.content}")
            logger.info(f"story_edit_task content id: {content.id} story: {story}")
				`@ -0,0 +1 @@`
				`以下是一些帖子的内容，请从中选择2条最适合创作故事的帖子，回复格式为数字编号，例如：1,3`