Python Threading Or Multiprocessing For Web-crawler?
Solution 1:
The rule of thumb when deciding whether to use threads in Python or not is to ask the question, whether the task that the threads will be doing, is that CPU intensive or I/O intensive. If the answer is I/O intensive, then you can go with threads.
Because of the GIL, the Python interpreter will run only one thread at a time. If a thread is doing some I/O, it will block waiting for the data to become available (from the network connection or the disk, for example), and in the meanwhile the interpreter will context switch to another thread. On the other hand, if the thread is doing a CPU intensive task, the other threads will have to wait till the interpreter decides to run them.
Web crawling is mostly an I/O oriented task, you need to make an HTTP connection, send a request, wait for response. Yes, after you get the response you need to spend some CPU to parse it, but besides that it is mostly I/O work. So, I believe, threads are a suitable choice in this case.
(And of course, respect the robots.txt, and don't storm the servers with too many requests :-)
Solution 2:
Another alternative is asynchronous I/O, which is much better for this kind of I/O-bound tasks (unless processing a page is really expensive). You can try both with asyncio or Tornado, using its httpclient.
Post a Comment for "Python Threading Or Multiprocessing For Web-crawler?"