加入收藏 | 设为首页 | 会员中心 | 我要投稿 PHP编程网 - 黄冈站长网 (http://www.0713zz.com/)- 数据应用、建站、人体识别、智能机器人、语音技术!
当前位置: 首页 > 综合聚焦 > 编程要点 > 语言 > 正文

敏感词过滤算法达成

发布时间:2021-12-07 15:09:05 所属栏目:语言 来源:互联网
导读:敏感词、文字过滤是一个网站必不可少的功能,如何设计一个好的、高效的过滤算法是非常有必要的。 在实现文字过滤的算法中,DFA是唯一比较好的实现算法。DFA即Deterministic Finite Automaton,也就是确定有穷自动机,它是是通过event和当前的state得到下一
敏感词、文字过滤是一个网站必不可少的功能,如何设计一个好的、高效的过滤算法是非常有必要的。
 
在实现文字过滤的算法中,DFA是唯一比较好的实现算法。DFA即Deterministic Finite Automaton,也就是确定有穷自动机,它是是通过event和当前的state得到下一个state,即event+state=nextstate。在实现敏感词过滤的算法中,我们必须要减少运算,而DFA在DFA算法中几乎没有什么计算,有的只是状态的转换。
 
下面看下在c#方法下实现方式
 
1、构建敏感词库类
private bool LoadDictionary()
       {
           var wordList = new List<string>();
           if (_memoryLexicon == null)
           {
               _memoryLexicon = new WordGroup[char.MaxValue];
               var words = new SensitiveWordBll().GetAllWords();
               if (words == null)
                   return false;
               foreach (string word in words)
               {
                   wordList.Add(word);
                   var chineseWord = Microsoft.VisualBasic.Strings.StrConv(word,
                       Microsoft.VisualBasic.VbStrConv.TraditionalChinese, 0);
                   if (word != chineseWord)
                       wordList.Add(chineseWord);
               }
               foreach (var word in wordList)
               {
                   if (word.Length > 0)
                   {
                       var group = _memoryLexicon[word[0]];
                       if (group == null)
                       {
                           group = new WordGroup();
                           _memoryLexicon[word[0]] = group;
                       }
                       group.Add(word.Substring(1));
                   }
               }
           }
           return true;
       }
2、构建敏感词检测类
private bool Check(string blackWord)
     {
         _wordlenght = 0;
         //检测源下一位游标
         _nextCursor = _cursor + 1;
         var found = false;
         var continueCheck = 0;
         //遍历词的每一位做匹配
         for (var i = 0; i < blackWord.Length; i++)
         {
             //特殊字符偏移游标
             var offset = 0;
             if (_nextCursor >= _sourceText.Length)
             {
                 if (i - 1 < blackWord.Length - 1)
                     found = false;
                 break;
             }
             else
             {
                 //检测下位字符如果不是汉字 数字 字符 偏移量加1
                 for (var y = _nextCursor; y < _sourceText.Length; y++)
                 {
                     if (!IsChs(_sourceText[y]) && !IsNum(_sourceText[y]) && !IsAlphabet(_sourceText[y]))
                     {
                         offset++;
                         //避让特殊字符,下位游标如果>=字符串长度 跳出
                         if (_nextCursor + offset >= _sourceText.Length)
                             break;
                         _wordlenght++;
                     }
                     else break;
                 }
                 if (_nextCursor + offset >= _sourceText.Length)
                 {
                     found = false;
                     break;
                 }
                 if (blackWord[i] == _sourceText[_nextCursor + offset])
                 {
                     found = true;
                     continueCheck = 0;
                 }
                 else
                 {
                     // 匹配不到时尝试继续匹配4个字符
                     if (continueCheck < 4 && _nextCursor < _sourceText.Length - 1)
                     {
                         continueCheck++;
                         i--;
                     }
                     else
                     {
                         found = false;
                         break;
                     }
                 }
             }
             _nextCursor = _nextCursor + 1 + offset;
             _wordlenght++;
         }
         return found;
     }
 } 

(编辑:PHP编程网 - 黄冈站长网)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    热点阅读