关于重复主题的新闻去重 #448
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
wontfix
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: HswOAuth/llm_course#448
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
请问各位老师,我有一些csv文件的新闻文本,同一件事由于不同的人编写导致差异较大,如何进行去重
可以尝试使用TF-IDF或者或词嵌入对文本进行向量化,然后计算每对新闻文本之间的余弦相似度,找出相似度高于阈值的新闻对,并标记其中一篇为重复。