Yesterday I had to check why the batch system is so slow. Leading sometimes into a timeout.
I have to import a file into my system. I file with 150.000 lines. I must process each line and include it in my database.
What does each line represent? A vehicle. Each line has the characteristics of a second hand vehicle by brands, the colors, the availabilities and the city where it’s being sold.
But before inserting, I have to check the line if it’s valid.
If the brand is known by my system. If the color is known by my system. If the availability is positive (avoid storing 0 values). And if the city is in our network.
So, here is the code:
for line in lines:
brand = line["brand"]
if not check_brand(brand):
continue
color = line["color"]
if not check_color(color):
continue
city = line["city"]
if not check_city(city):
continue
amount = line["amount"]
if not check_amount(amount):
continue
item = parse_line(line)
db.save(item)
This seems pretty easy. But for 150k lines, it takes more than 3 hours. This causes the clients to start using the system before it’s updated.
Here is the optimization I’ve done to speed it 30 times. I’ve done this the same way for other systems to optimize the time response.
Looking into the checks, there is database access and a loop like this one:
def check_color(color):
all_colors = Color.query.all()
for stored_color in all_colors:
if color == stored_color.name:
return True
return False
There are two optimizations I can do.
The first one is to avoid database access for each check. 3 checks. 150k lines. 450k database access. That’s a lot of IOs.
The brands, colors and cities are always the same. For each check. so download them before iterating.
all_brands = Brands.query.all()
all_colors = Colors.query.all()
all_cities = Cities.query.all()
for line in lines:
brand = line["brand"]
if not check_brand(brand, all_brands):
continue
color = line["color"]
if not check_color(color, all_colors):
continue
city = line["city"]
if not check_city(city, all_cities):
continue
amount = line["amount"]
if not check_amount(amount):
continue
item = parse_line(line)
db.save(item)
This leads to only 3 database access. A lot less.
Let’s go now with another optimization. This one is most often used in advanced search systems.
For each check, I must iterate over the list to find the presence of the current brand, color, or city.
3 loops, 150k lines. 450k loops
This is more important when the lists are bigger (with thousands of elements).
Instead of lists, I use dictionaries. Instead of iterating over the list, I check the presence of a key.
Accessing a dictionary won’t iterate over the complete dictionnary. Each key is indexed. This is also used with HashMaps in Java.
Let’s see the result:
all_brands = Brand.query.all()
all_colors = Color.query.all()
all_cities = City.query.all()
dict_brands = {}
for brand in all_brands:
dict_brands[brand.name] = brand
dict_colors = {}
for color in all_colors:
dict_colors[color.name] = color
dict_cities = {}
for city in all_cities:
dict_cities[city.id] = city
for line in lines:
brand = line["brand"]
if not check_brand(brand, dict_brands):
continue
color = line["color"]
if not check_color(color, dict_colors):
continue
city = line["city"]
if not check_city(city, dict_cities):
continue
amount = line["amount"]
if not check_amount(amount):
continue
item = parse_line(line)
db.save(item)
def check_color(color, dict_colors):
return color in dict_colors
This way, iterating over a single line implies no loop and no database access.
As said, this kind of optimization boosted the time response 30 times better.
If you want to learn more about good quality code, make sure to follow me on Youtube.



Leave a comment