How To Efficiently Search For A List Of Strings In Another List Of Strings Using Python?
Solution 1:
I'm not sure if that is what you are looking for:
executives = ['Brian Olsavsky', 'Some Guy', 'Some Lady']
text = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores."]
result= [s for s in text if any(ex in s for ex in executives)]
print(result)
output: ['Brian Olsavsky - Amazon.com']
Solution 2:
str = ['Justin Post - Bank of America',
"Great. Thank you for taking my question. I guess the big one is the deceleration in unit growth or online stores.",
"I know it's a tough 3Q comp, but could you comment a little bit about that?",
'Brian Olsavsky - Amazon.com',
"Thank you, Justin. Yeah, let me just remind you a couple of things from last year.",
"We had two reactions on our Super Saver Shipping threshold in the first half." ,
"I'll just remind you that the units those do not count",
"In-stock is very strong, especially as we head into the holiday period.",
'Dave Fildes - Amazon.com',
"And, Justin, this is Dave. Just to add on to that. You mentioned the online stores"]
executives = ['Brian Olsavsky', 'Justin', 'Some Guy', 'Some Lady']
As an addition, if you need the exact location, you could use this:
print([[i, str.index(q), q.index(i)] for i in executives for q in str if i in q ])
this outputs
[['Brian Olsavsky', 3, 0], ['Justin', 0, 0], ['Justin', 4, 11], ['Justin', 9, 5]]
Solution 3:
TLDR
This answer is focusing on efficiency. Use the other answers if it is not a key issue. If it is, make a dict
from the corpus you are searching in, then use this dict to find what you are looking for.
#import stuff we need later
importstringimport random
import numpy as np
import time
import matplotlib.pyplotas plt
Create example corpus
First we create a list of strings which we will search in.
Create random words, by which I mean random sequence of characters, with a length drawn from a Poisson distribution, with this function:
defpoissonlength_words(lam_word): #generating words, length chosen from a Poisson distribreturn''.join([random.choice(string.ascii_lowercase) for _ inrange(np.random.poisson(lam_word))])
(lam_word
being the parameter of the Poisson distribution.)
Let's create number_of_sentences
variable length sentences from these words (by sentence I mean list of the randomly generated words separated by spaces).
The length of the sentences can also be drawn from a Poisson distribution.
lam_word=5lam_sentence=1000number_of_sentences = 10000sentences = [' '.join([poissonlength_words(lam_word) for _ in range(np.random.poisson(lam_sentence))])
for x in range(number_of_sentences)]
sentences[0]
now will start like this:
tptt lxnwf iem fedg wbfdq qaa aqrys szwx zkmukc...
Let's also create names, which we will search for. Let these names be bigrams. The first name (ie first element of bigram) will be n
characters, last name (second bigram element) will be m
character long, and it will consist of random characters:
def bigramgen(n,m):
return ''.join([random.choice(string.ascii_lowercase) for _ in range(n)])+' '+\
''.join([random.choice(string.ascii_lowercase) for _ in range(m)])
The task
Let's say we want to find sentences where bigrams such as ab c
appears. We don't want to find dab c
or ab cd
, only where ab c
stands alone.
To test how fast a method is, let's find an ever-increasing number of bigrams, and measure the elapsed time. The number of bigrams we search for can be, for example:
number_of_bigrams_we_search_for= [10,30,50,100,300,500,1000,3000,5000,10000]
- Brute force method
Just loop through each bigram, loop through each sentence, use in
to find matches. Meanwhile, measure elapsed time with time.time()
.
bruteforcetime=[]
for number_of_bigrams in number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for _ in range(number_of_bigrams)]
start = time.time()
for bigram in bigrams:
#the core of the brute force method starts here
reslist=[]
for sentencei, sentence in enumerate(sentences):
if' '+bigram+' 'in sentence:
reslist.append([bigram,sentencei])
#and ends here
end = time.time()
bruteforcetime.append(end-start)
bruteforcetime
will hold the number of seconds necessary to find 10, 30, 50 ... bigrams.
Warning: this might take a long time for high number of bigrams.
- The sort your stuff to make it quicker method
Let's create an empty set for every word appearing in any of the sentences (using dict comprehension):
worddict={word:set() for sentence in sentences for word in sentence.split(' ')}
To each of these sets, add the index
of each word in which it appears:
for sentencei, sentence in enumerate(sentences):
for wordi, word in enumerate(sentence.split(' ')):
worddict[word].add(sentencei)
Note that we only do this once, no matter how many bigrams we search later.
Using this dictionary, we can seach for the sentences where the each part of the bigram appears. This is very fast, since calling a dict value is very fast. We then take the intersection of these sets. When we are searching for ab c
, we will have a set of sentence indexes where ab
and c
both appears.
for bigram in bigrams:
reslist=[]
setlist = [worddict[gram] for gram in target.split(' ')]
intersection = set.intersection(*setlist)
for candidate in intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
Let's put the whole thing together, and measure the time elapsed:
logtime=[]
fornumber_of_bigramsin number_of_bigrams_we_search_for:
bigrams = [bigramgen(2,1) for_inrange(number_of_bigrams)]
start_time=time.time()
worddict={word:set() forsentencein sentences forwordin sentence.split(' ')}
forsentencei, sentence inenumerate(sentences):
forwordi, word inenumerate(sentence.split(' ')):
worddict[word].add(sentencei)
forbigramin bigrams:
reslist=[]
setlist = [worddict[gram] forgramin bigram.split(' ')]
intersection = set.intersection(*setlist)
forcandidatein intersection:
if bigram in sentences[candidate]:
reslist.append([bigram, candidate])
end_time=time.time()
logtime.append(end_time-start_time)
Warning: this might take a long time for high number of bigrams, but less than the brute force method.
Results
We can plot how much time each method took.
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Or, plotting the y axis
on a log scale:
plt.plot(number_of_bigrams_we_search_for, bruteforcetime,label='linear')
plt.plot(number_of_bigrams_we_search_for, logtime,label='log')
plt.yscale('log')
plt.legend()
plt.xlabel('Number of bigrams searched')
plt.ylabel('Time elapsed (sec)')
Giving us the plots:
Making the worddict
dictionary takes a lot of time, and is a disadvantage when searching for a small number of names. There is a point however, that the corpus is big enough and the number of names we are searching for is high enough that this time is compensated by the speed of searching in it, compared to the brute force method. So, if these conditions are satisfied, I recommend using this method.
(Notebook available here.)
Post a Comment for "How To Efficiently Search For A List Of Strings In Another List Of Strings Using Python?"