Skip to content Skip to sidebar Skip to footer

Python Efficient Way To Check If Very Large String Contains A Substring

Python is not my best language, and so I'm not all that good at finding the most efficient solutions to some of my problems. I have a very large string (coming from a 30 MB file) a

Solution 1:

Is it really slow? You're talking about 30MB string; let's try it with even bigger string:

In [12]: string="agu82934u"*50*1024*1024+"string to be found"

In [13]: len(string)
Out[13]: 471859218

In [14]: %timeit "string to be found" in string
1 loops, best of 3: 335 ms per loop

In [15]: %timeit "string not to be found" in string
1 loops, best of 3: 200 ms per loop

I wouldn't say that 335 ms is much time looking for substring in 450MB string.


Solution 2:

How slow is too slow? I just did an a in b test for a unique string at the end of a 170 MB string. It finished before my finger left the Enter key.


Solution 3:

You can use one of these algorithms:

Although I believe KMP is more efficient, it's more complicated to implement.The first link includes some pseudo-code that should make it very easy to implement in python.

you can look for alternatives here: http://en.wikipedia.org/wiki/String_searching_algorithm


Solution 4:

I do not see how to make it more optimal on the comparison, to be honest. But you can use less memory and lose less time with I/O if you read the file line by line:

has_small_string = False
for line in large_file:
    if small_string in line:
        has_small_string = True
        break
if has_small_string:
   # ... Your stuff here ...

This is no revolutionary improvement, and can be even less useful if you really need the large string in the memory, but it may be helpful, so I am posting here :)


Solution 5:

If you just want to check if that substring exists,

for line in open("file"):
    if substring in line:
         print "exists"
         sys.exit() # or break to do some other stuff

Post a Comment for "Python Efficient Way To Check If Very Large String Contains A Substring"