EverythingPython

Concurrency in Python

These next set of posts are going to be about Concurrency and Async in Python. So, one non-tech thing about me is that I run a bookclub that meets once a month. In this bookclub, we discuss books we’ve read that month and at the end of the meet, we have a nice list of books that each person has mentioned and/or discussed. After I get back home, I typically try and get the Goodreads link for each book discussed and create a little more comprehensive list for people to refer to later on.

I used to do this manually initially and once the process got tedious, I decided to take a programmatic approach.

I wrote a small program to look up the book’s name along with the string “Goodreads” and do a best-effort retrieval of the book’s link on Goodreads. Now, today I can probably do that using OpenAI APIs but that’s a post for a different day.

Coming back to the programmatic retrieval, I used to do this sequentially for a long time. And because I didn’t want to seem like a DOS attack was taking place, I introduced a sleep function in between each download.

Here’s a function that given the name of a book, will look for the name of the book along with “Goodreads” and pick the link that is likely to be the link for a book -

 1def download_book(title):
 2    title_dict = {}
 3    print(f"Processing {title}")
 4    counter = 0
 5    for i in search(title+ " goodreads", num_results=5):
 6        if counter == 0:
 7            first_result = i
 8        if "goodreads.com/book/show" in i:
 9            title_dict[title] = i
10            break
11    if len(title_dict) == 0:
12        title_dict[title] = first_result
13    sleep(5)
14    return title_dict

While I was thinking of a way to introduce concurrent processing using Python, this felt like an excellent usecase to try and automate, so here we go -

This is how the sequential retrieval of the book link list looked like -

 1import pandas as pd
 2import time
 3time_start = time.time()
 4df = pd.read_csv('data/books_5.csv')
 5books_list = df["title"].tolist()
 6num_books = len(books_list)
 7res = []
 8for book in books_list:
 9    res.append(download_book(book))
10
11book_list = {"Title":[], "Link":[]}
12for i in res:
13    for k, v in i.items():
14        book_list["Title"].append(k)
15        book_list["Link"].append(v)
16
17df = pd.DataFrame(book_list)
18df.to_csv("data/book_links_5_seq.csv", index=False)
19
20time_end = time.time()
21time_taken = time_end - time_start
22
23print(f"Downloaded {num_books} books in {time_taken:.2f} seconds")

This did the job but it took quite some time -

1Processing The Talented Mr Ripley
2Processing Ripley's game
3Processing Invisible Women
4Processing Beyond Interpretation
5Processing Men without women
6
7Downloaded 5 books in 28.96 seconds

Now let’s try the same program using futures

 1from concurrent import futures
 2import pandas as pd
 3import time
 4
 5time_start = time.time()
 6df = pd.read_csv('data/books_5.csv')
 7MAX_WORKERS = 5 
 8books_list = df["title"].tolist()
 9num_books = len(books_list)
10from time import sleep
11workers = min(MAX_WORKERS, num_books) 
12with futures.ThreadPoolExecutor(workers) as executor:
13    res = list(executor.map(download_book, sorted(books_list)))
14book_list = {"Title":[], "Link":[]}
15for i in res:
16    for k, v in i.items():
17        book_list["Title"].append(k)
18        book_list["Link"].append(v)
19
20df = pd.DataFrame(book_list)
21df.to_csv("data/book_links_5.csv", index=False)
22
23time_end = time.time()
24time_taken = time_end - time_start
25
26print(f"Downloaded {num_books} books in {time_taken:.2f} seconds")

And unsurprisingly -

1Processing Beyond Interpretation
2Processing Invisible Women
3Processing Men without women
4Processing Ripley's game
5Processing The Talented Mr Ripley
6
7Downloaded 5 books in 6.70 seconds

So it’s understood that in this case, where a network call was involved, the concurrent processing was much faster than the sequential processing. This is because the network call is the bottleneck here and the CPU is mostly idle while waiting for the network call to return. Also it’s important to note that the network calls were disjoint and hence could be parallelized.

The same reasoning and method can be applied to I/O bound tasks as well.

If you want to try this out, you can find the code and the data here