| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131 |
- <?php
- namespace App\Services;
- use Illuminate\Support\Facades\Log;
- class AsyncMarkdownSplitter
- {
- /**
- * 将 Markdown 切分为题目数组
- *
- * @param string $markdown 原始 Markdown 文本
- * @return array 题目数组,每个元素包含 index 和 raw_markdown
- */
- public function split(string $markdown): array
- {
- // 使用正则表达式识别题号作为切分点(只接受“数字 + 明确分隔符”)
- // 注意:不要用 “数字 + 空白” 作为切分点,会误切正文中的列表/步骤/年份等。
- $pattern = '/^\s*(\d{1,4})(?:[\\..、\\))\\]】])\\s*/m';
- // 找到所有匹配的位置
- preg_match_all($pattern, $markdown, $matches, PREG_OFFSET_CAPTURE);
- $candidates = [];
- if (empty($matches[0])) {
- // 没有找到题号,整个作为一块
- return [
- [
- 'index' => 1,
- 'raw_markdown' => trim($markdown)
- ]
- ];
- }
- // 构建分块
- $positions = [];
- foreach ($matches[0] as $match) {
- $positions[] = $match[1];
- }
- for ($i = 0; $i < count($positions); $i++) {
- $start = $positions[$i];
- $end = $i + 1 < count($positions) ? $positions[$i + 1] : strlen($markdown);
- $block = substr($markdown, $start, $end - $start);
- $block = trim($block);
- if (!empty($block)) {
- // 提取题号作为 index
- preg_match('/^\s*(\d+)/', $block, $indexMatch);
- $index = $indexMatch[1] ?? ($i + 1);
- $candidates[] = [
- // sequence:文件内顺序,保证唯一,不会因为 index 重复而覆盖
- 'sequence' => $i + 1,
- 'index' => (int)$index,
- 'raw_markdown' => $block
- ];
- }
- }
- return $candidates;
- }
- /**
- * 验证切分结果
- *
- * @param array $candidates 切分结果
- * @return bool
- */
- public function validate(array $candidates): bool
- {
- // 题号重复在“多套试卷/多章节合并”场景是正常现象,不应判定为失败。
- // 仅做轻量日志,避免输出超长 indexes 列表刷屏。
- $indexes = array_map(fn($item) => $item['index'], $candidates);
- $uniqueCount = count(array_unique($indexes));
- $total = count($indexes);
- if ($total > 0 && $uniqueCount !== $total) {
- Log::warning('Duplicate question indexes detected', [
- 'total' => $total,
- 'unique' => $uniqueCount,
- ]);
- }
- // 检查每个候选是否有内容
- foreach ($candidates as $candidate) {
- if (empty($candidate['raw_markdown'])) {
- Log::warning('Empty markdown content detected', [
- 'index' => $candidate['index']
- ]);
- return false;
- }
- }
- return true;
- }
- /**
- * 获取切分统计信息
- *
- * @param array $candidates 切分结果
- * @return array
- */
- public function getStatistics(array $candidates): array
- {
- $total = count($candidates);
- $avgLength = 0;
- $maxLength = 0;
- $minLength = PHP_INT_MAX;
- foreach ($candidates as $candidate) {
- $length = strlen($candidate['raw_markdown']);
- $avgLength += $length;
- $maxLength = max($maxLength, $length);
- $minLength = min($minLength, $length);
- }
- if ($total > 0) {
- $avgLength = round($avgLength / $total, 2);
- }
- return [
- 'total_candidates' => $total,
- 'avg_length' => $avgLength,
- 'max_length' => $maxLength,
- 'min_length' => $minLength === PHP_INT_MAX ? 0 : $minLength
- ];
- }
- }
|