博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python字符搜算stsearch include "stringlib/fastsearc
阅读量:5894 次
发布时间:2019-06-19

本文共 5931 字,大约阅读时间需要 19 分钟。

hot3.png

最近其他语言需要实现一个快速搜索的函数,, 好吧,看过python 源码了,可以求助与 万能的python 了,

字符串匹配算法的网站 :

#![feature(slicing_syntax)] #![feature(phase)]#[phase(plugin, link)] extern crate log;pub mod search {use std::collections::HashMap;fn skip(t: &Vec
, hash: & mut HashMap
) { //v.reverse(); //let mut hash = HashMap::new(); let mut pos :u8 = 0; for i in t.iter(){ hash.insert(*i as u8, t.len() - pos as uint -1); pos +=1; } for i in hash.iter(){ debug!("{}",i); }}pub fn byyel(s :&Vec
, t: &Vec
, start:uint) -> Vec
{ let n = s.len(); let m = t.len(); let mut r = n -m ; let mut i = 0u + start; let mut k :uint; let mut hash :HashMap
= HashMap::new(); skip(t, & mut hash); let mut element_position : Vec
= Vec::new(); while ( i < n-m ){ debug!("i = {}",i); debug!("at while loop..") if (s[i+m-1] == t[m-1]){ if (s[i..i+m-1] == t[0..m-1]) { //return Some(i); debug!("s is {}", s[i..i+m-1]); debug!("t us {}", t[0..m-1]); element_position.push(i); i +=1; } else if ( hash.get( &s[i+m] ) == None) { debug!("at nop 1"); debug!("move {}", m-1) i = i+m -1; } else { debug!("at else one "); let mut element = s[i+m]; //t.get(s[i+m] as uint).unwrap(); let skip= *hash.get( &element ).unwrap() as uint; i = i+ skip + 1; debug!("move {}",skip); } } else{ debug!("...at else..and s[i+m] = {}", s[i+m]); debug!("t get {} {}", s[i+m] , t.get(s[i+m] as uint) ); if ( hash.get(& s[i+m]) == None ){ debug!("at nop 2"); i = i +m +1; debug!("move {}", m+1) } else { i = i+1; } } debug!("source {}",s[i..n]); debug!("dest {}",t) debug!("i ={} and n-m = {}", i, n-m); } return element_position;}}fn main(){ let mut s = vec![1u8,2,3,4,5,6,7,13,10,13,10,13,4,0,8,8,0,9,5,6,7,8,9,4,8,8,4,4,4,8,8,8,4,0,4]; let mut t = vec![10u8,13,10,13]; let z = search::byyel(&s,&t, 0); debug!("{}",z);}

在官网找到一段文字,记下来吧:

The stringlib library is an experimental collection of alternative string operations for Python.

The fast search algorithm described below was added to Python 2.5 during the Need For Speed sprint in Reykjavik. It’s currently used by the 8-bit string and Unicode implementations.

Other components and concepts may appear in future Python releases. The Fast Search Algorithm (aka “BMHBNFS” ;-) #

The new find implementation is typically 2-10 times faster than the one used in Python 2.3 (using a reasonably representative set of test strings, that is). For the find portions of the stringbench test suite, the new algorithm is up to 26 times faster.

The same algorithm can be used for index, split, replace, contains, and the SRE prefix scanner.

When designing the new algorithm, I used the following constraints:

should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test (dead link)small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)sublinear search behaviour in good cases (O(n/m))no worse than the current algorithm in worst case (O(nm))should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)many real-life searches should be good, very few should be worst casereasonably simple implementation

This rules out most standard algorithms (Knuth-Morris-Pratt is not sublinear, Boyer-Moore needs tables that depend on both the alphabet size and the pattern size, most Boyer-Moore variants need tables that depend on the pattern size, etc.).

After some tweaking, I came up with a simplication of Boyer-Moore, incorporating ideas from Horspool and Sunday. Here’s an outline:

def find(s, p): # find first occurrence of p in s n = len(s) m = len(p) skip = delta1(p)[p[m-1]] i = 0 while i <= n-m: if s[i+m-1] == p[m-1]: # (boyer-moore) # potential match if s[i:i+m-1] == p[:m-1]: return i if s[i+m] not in p: i = i + m + 1 # (sunday) else: i = i + skip # (horspool) else: # skip if s[i+m] not in p: i = i + m + 1 # (sunday) else: i = i + 1 return -1 # not found

The delta1(p)[p[m-1]] value is simply the Boyer-Moore delta1 (or bad-character skip) value for the last character in the pattern.

For the s[i+m] not in p test, I use a 32-bit bitmask, using the 5 least significant bits of the character as the key. This could be described as a simple Bloom filter.

Note that the above Python code may access s[n], which would result in an IndexError exception. For the CPython implementation, this is not really a problem, since CPython adds trailing NULL entries to both 8-bit and Unicode strings. Benchmarks #

As reported from the NeedForSpeed sprint (May 2006):

These are highly preliminary figures, but after two days of full attention, the Unicode string type is beginning to get pretty fast. The following is total run times for the stringbench benchmark:

str(ms) uni(ms) %% build %%----------------------------- 2271.31 3608.32 62.9 2.5a2 2261.85 1187.84 190.4 trunk, wednesday morning 2227.48 1002.61 222.2 trunk, wednesday 15:11 UTC 2247.84 875.13 256.9 trunk, wednesday 16.35 UTC

The above is simply the sum of all benchmarks subtests, so the 4x factor is a bit arbitrary. However, you can tell two things from this: the unicode string type has gotten a lot faster, and on this set of tests, it’s actually twice as fast as the 8-bit string type. Not bad.

Looking at certain subsets are also quite interesting. Here’s the result for the count-related tests (which includes counting 4-letter combinations in a 175k DNA sequence):

str(ms) uni(ms) %% build %%----------------------------- 332.44 294.07 113.0 2.5a2 329.90 120.63 273.5 trunk

and here’s the result for the find/index related tests:

str(ms) uni(ms) %% build %%----------------------------- 761.40 1857.61 41.0 2.5a2 758.84 70.93 1069.8 trunk

(yes, it’s 26 times faster!)

Time to turn our attention to the 8-bit string type. More later.

转载于:https://my.oschina.net/innovation/blog/353331

你可能感兴趣的文章
Pytext简介:facebook的基于PyTorch的NLP框架
查看>>
[SQL server 2005]之一配置自动管理数据库和日志大小
查看>>
css 溢出部分隐藏
查看>>
show ip bgp neighbors 信息解读
查看>>
强大的PDF转换控件ActivePDF WebGrabber
查看>>
数字设计
查看>>
RHEL7 挂载U盘
查看>>
Linux安装FTP
查看>>
ios开发入门- WebView使用
查看>>
mysql系列之多实例2----基于多配置文件
查看>>
Blender的插件开发-Operator操作器(算子)
查看>>
Qt学习笔记三:QT项目的代码结构
查看>>
在 RHEL 5.5 下配置 Cacti 0.8.7g
查看>>
十大 Wordpress Hack 技巧
查看>>
让 SCOM 2007 R2 使用 SQL Server 2008 R2 数据库
查看>>
我的友情链接
查看>>
无盘启动esxi物理主机及加载配置
查看>>
Java InetSocketAddress 类说明
查看>>
linux用户密码策略测试
查看>>
kali 更新源 sources.list以及apt-get 相关命令说明
查看>>