hub / github.com/houbb/sensitive-word

github.com/houbb/sensitive-word @main sqlite

repository ↗ · DeepWiki ↗

764 symbols 2,983 edges 192 files 385 documented · 50% 2 cross-repo links

README

sensitive-word

sensitive-word 基于 DFA 算法实现的高性能敏感词工具。

如果有一些疑难杂症，可以加入：技术交流群

sensitive-word-admin 是对应的控台的应用，目前功能处于初期开发中，MVP 版本可用。

创作目的

大家好，我是老马。

一直想实现一款简单好用敏感词工具，于是开源实现了这个工具。

基于 DFA 算法实现，目前敏感词库内容收录 6W+（源文件 18W+，经过一次删减）。

后期将进行持续优化和补充敏感词库，并进一步提升算法的性能。

v0.24.0 开始内置支持对敏感词的分类细化，不过工作量比较大，难免存在疏漏。

欢迎 PR 改进， github 提需求，或者加入技术交流群沟通吹牛！

特性

全角半角互换、英文大小写互换、数字常见形式的互换、中文繁简体互换、英文常见形式的互换、忽略重复词等

项目推荐

下面是一些日志、加解密、脱敏安全相关的库推荐：

项目	介绍
sensitive-word	高性能敏感词核心库
sensitive-word-admin	敏感词控台，前后端分离
sensitive	高性能日志脱敏组件
auto-log	统一日志切面组件，支持全链路traceId
encryption-local	离线加密机组件
encryption	加密机标准API+本地客户端
encryption-server	加密机服务

变更日志

CHANGE_LOG.md

支持开源

开源不易，如果本项目对你有帮助，你可以请老马喝一杯奶茶。

快速开始

准备

JDK1.8+
Maven 3.x+

Maven 引入

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>sensitive-word</artifactId>
    <version>0.29.5</version>
</dependency>

核心方法

SensitiveWordHelper 作为敏感词的工具类，核心方法如下：

注意：SensitiveWordHelper 提供的都是默认配置。如果你希望进行灵活的自定义配置，可参考引导类特性配置

方法	参数	返回值	说明
contains(String)	待验证的字符串	布尔值	验证字符串是否包含敏感词
replace(String, ISensitiveWordReplace)	使用指定的替换策略替换敏感词	字符串	返回脱敏后的字符串
replace(String, char)	使用指定的 char 替换敏感词	字符串	返回脱敏后的字符串
replace(String)	使用 `*` 替换敏感词	字符串	返回脱敏后的字符串
findAll(String)	待验证的字符串	字符串列表	返回字符串中所有敏感词
findFirst(String)	待验证的字符串	字符串	返回字符串中第一个敏感词
findAll(String, IWordResultHandler)	IWordResultHandler 结果处理类	字符串列表	返回字符串中所有敏感词
findFirst(String, IWordResultHandler)	IWordResultHandler 结果处理类	字符串	返回字符串中第一个敏感词
tags(String)	获取敏感词的标签	敏感词字符串	返回敏感词的标签列表

判断是否包含敏感词

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

Assert.assertTrue(SensitiveWordHelper.contains(text));

返回第一个敏感词

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("五星红旗", word);

SensitiveWordHelper.findFirst(text) 等价于：

String word = SensitiveWordHelper.findFirst(text, WordResultHandlers.word());

返回所有敏感词

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星红旗, 毛主席, 天安门]", wordList.toString());

返回所有敏感词用法上类似于 SensitiveWordHelper.findFirst()，同样也支持指定结果处理类。

SensitiveWordHelper.findAll(text) 等价于：

List<String> wordList = SensitiveWordHelper.findAll(text, WordResultHandlers.word());

WordResultHandlers.raw() 可以保留对应的下标信息、类别信息：

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

// 默认敏感词标签为空
List<WordTagsDto> wordList1 = SensitiveWordHelper.findAll(text, WordResultHandlers.wordTags());
Assert.assertEquals("[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]", wordList1.toString());

默认的替换策略

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text);
Assert.assertEquals("****迎风飘扬，***的画像屹立在***前。", result);

指定替换的内容

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";
String result = SensitiveWordHelper.replace(text, '0');
Assert.assertEquals("0000迎风飘扬，000的画像屹立在000前。", result);

自定义替换策略

V0.2.0 支持该特性。

场景说明：有时候我们希望不同的敏感词有不同的替换结果。比如【游戏】替换为【电子竞技】，【失业】替换为【灵活就业】。

诚然，提前使用字符串的正则替换也可以，不过性能一般。

使用例子：

/**
 * 自定替换策略
 * @since 0.2.0
 */
@Test
public void defineReplaceTest() {
    final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

    ISensitiveWordReplace replace = new MySensitiveWordReplace();
    String result = SensitiveWordHelper.replace(text, replace);

    Assert.assertEquals("国家旗帜迎风飘扬，教员的画像屹立在***前。", result);
}

其中 MySensitiveWordReplace 是我们自定义的替换策略，实现如下：

public class MyWordReplace implements IWordReplace {

    @Override
    public void replace(StringBuilder stringBuilder, final char[] rawChars, IWordResult wordResult, IWordContext wordContext) {
        String sensitiveWord = InnerWordCharUtils.getString(rawChars, wordResult);
        // 自定义不同的敏感词替换策略，可以从数据库等地方读取
        if("五星红旗".equals(sensitiveWord)) {
            stringBuilder.append("国家旗帜");
        } else if("毛主席".equals(sensitiveWord)) {
            stringBuilder.append("教员");
        } else {
            // 其他默认使用 * 代替
            int wordLength = wordResult.endIndex() - wordResult.startIndex();
            for(int i = 0; i < wordLength; i++) {
                stringBuilder.append('*');
            }
        }
    }

}

我们针对其中的部分词做固定映射处理，其他的默认转换为 *。

IWordResultHandler 结果处理类

IWordResultHandler 可以对敏感词的结果进行处理，允许用户自定义。

内置实现见 WordResultHandlers 工具类：

WordResultHandlers.word()

只保留敏感词单词本身。

WordResultHandlers.raw()

保留敏感词相关信息，包含敏感词的开始和结束下标。

WordResultHandlers.wordTags()

同时保留单词，和对应的词标签信息。

使用实例

所有测试案例参见 SensitiveWordHelperTest

1）基本例子

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星红旗, 毛主席, 天安门]", wordList.toString());
List<String> wordList2 = SensitiveWordHelper.findAll(text, WordResultHandlers.word());
Assert.assertEquals("[五星红旗, 毛主席, 天安门]", wordList2.toString());

List<IWordResult> wordList3 = SensitiveWordHelper.findAll(text, WordResultHandlers.raw());
Assert.assertEquals("[WordResult{startIndex=0, endIndex=4}, WordResult{startIndex=9, endIndex=12}, WordResult{startIndex=18, endIndex=21}]", wordList3.toString());

2) wordTags 例子

我们在 dict_tag_test.txt 文件中指定对应词的标签信息。

final String text = "五星红旗迎风飘扬，毛主席的画像屹立在天安门前。";

// 默认敏感词标签为空
List<WordTagsDto> wordList1 = SensitiveWordHelper.findAll(text, WordResultHandlers.wordTags());
Assert.assertEquals("[WordTagsDto{word='五星红旗', tags=[]}, WordTagsDto{word='毛主席', tags=[]}, WordTagsDto{word='天安门', tags=[]}]", wordList1.toString());

List<WordTagsDto> wordList2 = SensitiveWordBs.newInstance()
        .wordTag(WordTags.file("dict_tag_test.txt"))
        .init()
        .findAll(text, WordResultHandlers.wordTags());
Assert.assertEquals("[WordTagsDto{word='五星红旗', tags=[政治, 国家]}, WordTagsDto{word='毛主席', tags=[政治, 伟人, 国家]}, WordTagsDto{word='天安门', tags=[政治, 国家, 地址]}]", wordList2.toString());

更多特性

后续的诸多特性，主要是针对各种针对各种情况的处理，尽可能的提升敏感词命中率。

这是一场漫长的攻防之战。

样式处理

忽略大小写

final String text = "fuCK the bad words.";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("fuCK", word);

忽略半角圆角

final String text = "ｆｕｃｋ the bad words.";

String word = SensitiveWordHelper.findFirst(text);
Assert.assertEquals("ｆｕｃｋ", word);

忽略数字的写法

这里实现了数字常见形式的转换。

final String text = "这个是我的微信：9⓿二肆⁹₈③⑸⒋➃㈤㊄";

List<String> wordList = SensitiveWordBs.newInstance().enableNumCheck(true).init().findAll(text);
Assert.assertEquals("[9⓿二肆⁹₈③⑸⒋➃㈤㊄]", wordList.toString());

忽略繁简体

final String text = "我爱我的祖国和五星紅旗。";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[五星紅旗]", wordList.toString());

忽略英文的书写格式

final String text = "Ⓕⓤc⒦ the bad words";

List<String> wordList = SensitiveWordHelper.findAll(text);
Assert.assertEquals("[Ⓕⓤc⒦]", wordList.toString());

忽略重复词

final String text = "ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦ the bad words";

List<String> wordList = SensitiveWordBs.newInstance()
        .ignoreRepeat(true)
        .init()
        .findAll(text);
Assert.assertEquals("[ⒻⒻⒻfⓤuⓤ⒰cⓒ⒦]", wordList.toString());

序号	方法	说明	默认值
16	wordCheckNum	数字检测策略(v0.25.0开始支持)	`WordChecks.num()`
17	wordCheckEmail	邮箱检测策略(v0.25.0开始支持)	`WordChecks.email()`
18	wordCheckUrl	URL检测策略(v0.25.0开始支持)，内置还是实现了 `urlNoPrefix()`	`(WordChecks.url()`
19	wordCheckIpv4	ipv4检测策略(v0.25.0开始支持)	`WordChecks.ipv4()`
20	wordCheckWord	敏感词检测策略(v0.25.0开始支持)	`WordChecks.word()`

Extension points exported contracts — how you extend this code

IWordDeny (Interface)

拒绝出现的数据-返回的内容被当做是敏感词 @author binbin.hou @since 0.0.13 [22 implementers]

src/main/java/com/github/houbb/sensitive/word/api/IWordDeny.java

IWordReplace (Interface)

敏感词替换策略 @author binbin.hou @since 0.2.0 [7 implementers]

src/main/java/com/github/houbb/sensitive/word/api/IWordReplace.java

IWordCheck (Interface)

敏感信息监测接口（1）敏感词（2）数字（连续8位及其以上）（3）邮箱（4）URL 可以使用责任链的模式，循环调用。 @author binbin.hou @since 0.0.5 [9 implementers]

src/main/java/com/github/houbb/sensitive/word/api/IWordCheck.java

IWordFormatText (Interface)

单词整体格式化 @author binbin.hou @since 0.28.0 [12 implementers]

src/main/java/com/github/houbb/sensitive/word/api/IWordFormatText.java

IWordFormat (Interface)

单词格式化（1）忽略大小写（2）忽略全角半角（3）忽略停顿词（4）忽略数字转换。 @author binbin.hou @since 0.0.5 [20 implementers]

src/main/java/com/github/houbb/sensitive/word/api/IWordFormat.java

Core symbols most depended-on inside this repo

init

called by 116

src/main/java/com/github/houbb/sensitive/word/support/tag/WordTags.java

newInstance

called by 115

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

contains

called by 72

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

findAll

called by 67

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

wordDeny

called by 60

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

wordAllow

called by 40

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

addWord

called by 39

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java

findAll

called by 30

src/main/java/com/github/houbb/sensitive/word/api/ISensitiveWord.java

Shape

Method 585

Class 156

Interface 19

Enum 4

Languages

Java100%

Modules by API surface

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordBs.java49 symbols

src/main/java/com/github/houbb/sensitive/word/bs/SensitiveWordContext.java30 symbols

src/main/java/com/github/houbb/sensitive/word/api/IWordContext.java28 symbols

src/test/java/com/github/houbb/sensitive/word/core/SensitiveWordHelperTest.java13 symbols

src/test/java/com/github/houbb/sensitive/word/bugs/b20260323/AddWordBugTest.java13 symbols

src/test/java/com/github/houbb/sensitive/word/bs/SensitiveWordBsResultConditionTest.java12 symbols

src/test/java/com/github/houbb/sensitive/word/bs/SensitiveWordBsTest.java11 symbols

src/main/java/com/github/houbb/sensitive/word/support/check/WordChecks.java11 symbols

src/main/java/com/github/houbb/sensitive/word/support/format/WordFormats.java10 symbols

src/main/java/com/github/houbb/sensitive/word/support/data/WordDataTree.java10 symbols

src/test/java/com/github/houbb/sensitive/word/support/tag/WordTagTest.java9 symbols

src/test/java/com/github/houbb/sensitive/word/benchmark/BenchmarkBasicTest.java9 symbols

Used by 2 indexed graphs manifest dependencies, hub-wide

github.com/YunaiV/ruoyi-vue-pro

github.com/YunaiV/yudao-cloud

Dependencies from manifests, versioned

com.github.houbb:heaven1×

com.github.houbb:opencc4j1×

com.github.houbb:sensitive-word-data1×

junit:junit1×

org.apache.lucene:lucene-core4.0.0 · 1×

For agents

$ claude mcp add sensitive-word \
  -- python -m otcore.mcp_server <graph>

⬇ download graph artifact

github.com/houbb/sensitive-word @main sqlite

sensitive-word

创作目的

特性

项目推荐

变更日志

更多资料

敏感词控台

敏感词标签文件

支持开源

快速开始

准备

Maven 引入

核心方法

判断是否包含敏感词

返回第一个敏感词

返回所有敏感词

默认的替换策略

指定替换的内容

自定义替换策略

IWordResultHandler 结果处理类

使用实例

更多特性

样式处理

忽略大小写

忽略半角圆角

忽略数字的写法

忽略繁简体

忽略英文的书写格式

忽略重复词

更多检测策略

说明

邮箱检测

连续数字检测

网址检测

IPV4 检测

Extension points exported contracts — how you extend this code

Core symbols most depended-on inside this repo

Shape

Languages

Modules by API surface

Used by 2 indexed graphs manifest dependencies, hub-wide

Dependencies from manifests, versioned

For agents