Concurrency in Python
These next set of posts are going to be about Concurrency and Async in Python. So, one non-tech thing about me is that I run a bookclub that meets once a month. In this bookclub, we discuss books we’ve read that month and at the end of the meet, we have a nice list of books that each person has mentioned and/or discussed. After I get back home, I typically try and get the Goodreads link for each book discussed and create a little more comprehensive list for people to refer to later on.
I used to do this manually initially and once the process got tedious, I decided to take a programmatic approach.
I wrote a small program to look up the book’s name along with the string “Goodreads” and do a best-effort retrieval of the book’s link on Goodreads. Now, today I can probably do that using OpenAI APIs but that’s a post for a different day.
Coming back to the programmatic retrieval, I used to do this sequentially for a long time. And because I didn’t want to seem like a DOS attack was taking place, I introduced a sleep function in between each download.
Here’s a function that given the name of a book, will look for the name of the book along with “Goodreads” and pick the link that is likely to be the link for a book -
1def download_book(title):
2 title_dict = {}
3 print(f"Processing {title}")
4 counter = 0
5 for i in search(title+ " goodreads", num_results=5):
6 if counter == 0:
7 first_result = i
8 if "goodreads.com/book/show" in i:
9 title_dict[title] = i
10 break
11 if len(title_dict) == 0:
12 title_dict[title] = first_result
13 sleep(5)
14 return title_dict
While I was thinking of a way to introduce concurrent processing using Python, this felt like an excellent usecase to try and automate, so here we go -
- Let’s assume we have a list of 5 books for which we need the Goodreads links.
- It is possible that we do not obtain the Goodreads link for a particular book. This is okay.
- The goal is to try and obtain the Goodreads links for as many books in our list as possible in the shortest amount of time.
This is how the sequential retrieval of the book link list looked like -
1import pandas as pd
2import time
3time_start = time.time()
4df = pd.read_csv('data/books_5.csv')
5books_list = df["title"].tolist()
6num_books = len(books_list)
7res = []
8for book in books_list:
9 res.append(download_book(book))
10
11book_list = {"Title":[], "Link":[]}
12for i in res:
13 for k, v in i.items():
14 book_list["Title"].append(k)
15 book_list["Link"].append(v)
16
17df = pd.DataFrame(book_list)
18df.to_csv("data/book_links_5_seq.csv", index=False)
19
20time_end = time.time()
21time_taken = time_end - time_start
22
23print(f"Downloaded {num_books} books in {time_taken:.2f} seconds")
This did the job but it took quite some time -
1Processing The Talented Mr Ripley
2Processing Ripley's game
3Processing Invisible Women
4Processing Beyond Interpretation
5Processing Men without women
6
7Downloaded 5 books in 28.96 seconds
Now let’s try the same program using futures
1from concurrent import futures
2import pandas as pd
3import time
4
5time_start = time.time()
6df = pd.read_csv('data/books_5.csv')
7MAX_WORKERS = 5
8books_list = df["title"].tolist()
9num_books = len(books_list)
10from time import sleep
11workers = min(MAX_WORKERS, num_books)
12with futures.ThreadPoolExecutor(workers) as executor:
13 res = list(executor.map(download_book, sorted(books_list)))
14book_list = {"Title":[], "Link":[]}
15for i in res:
16 for k, v in i.items():
17 book_list["Title"].append(k)
18 book_list["Link"].append(v)
19
20df = pd.DataFrame(book_list)
21df.to_csv("data/book_links_5.csv", index=False)
22
23time_end = time.time()
24time_taken = time_end - time_start
25
26print(f"Downloaded {num_books} books in {time_taken:.2f} seconds")
And unsurprisingly -
1Processing Beyond Interpretation
2Processing Invisible Women
3Processing Men without women
4Processing Ripley's game
5Processing The Talented Mr Ripley
6
7Downloaded 5 books in 6.70 seconds
So it’s understood that in this case, where a network call was involved, the concurrent processing was much faster than the sequential processing. This is because the network call is the bottleneck here and the CPU is mostly idle while waiting for the network call to return. Also it’s important to note that the network calls were disjoint and hence could be parallelized.
The same reasoning and method can be applied to I/O bound tasks as well.
If you want to try this out, you can find the code and the data here