How I’ve Optimized a Batch System by 30 Times

Yesterday I had to check why the batch system is so slow. Leading sometimes into a timeout.

I have to import a file into my system. I file with 150.000 lines. I must process each line and include it in my database.

What does each line represent? A vehicle. Each line has the characteristics of a second hand vehicle by brands, the colors, the availabilities and the city where it’s being sold.

But before inserting, I have to check the line if it’s valid.

If the brand is known by my system. If the color is known by my system. If the availability is positive (avoid storing 0 values). And if the city is in our network.

So, here is the code:

for line in lines:
    brand = line["brand"]
    if not check_brand(brand):
        continue

    color = line["color"]
    if not check_color(color):
        continue

    city = line["city"]
    if not check_city(city):
        continue

    amount = line["amount"]
    if not check_amount(amount):
        continue

    item = parse_line(line)
    db.save(item)

This seems pretty easy. But for 150k lines, it takes more than 3 hours. This causes the clients to start using the system before it’s updated.

Here is the optimization I’ve done to speed it 30 times. I’ve done this the same way for other systems to optimize the time response.

Looking into the checks, there is database access and a loop like this one:

def check_color(color):
    all_colors = Color.query.all()
    for stored_color in all_colors:
        if color == stored_color.name:
            return True
    return False

There are two optimizations I can do.

The first one is to avoid database access for each check. 3 checks. 150k lines. 450k database access. That’s a lot of IOs.

The brands, colors and cities are always the same. For each check. so download them before iterating.

all_brands = Brands.query.all()
all_colors = Colors.query.all()
all_cities = Cities.query.all()

for line in lines:
    brand = line["brand"]
    if not check_brand(brand, all_brands):
        continue

    color = line["color"]
    if not check_color(color, all_colors):
        continue

    city = line["city"]
    if not check_city(city, all_cities):
        continue

    amount = line["amount"]
    if not check_amount(amount):
        continue

    item = parse_line(line)
    db.save(item)

This leads to only 3 database access. A lot less.

Let’s go now with another optimization. This one is most often used in advanced search systems.

For each check, I must iterate over the list to find the presence of the current brand, color, or city.

3 loops, 150k lines. 450k loops

This is more important when the lists are bigger (with thousands of elements).

Instead of lists, I use dictionaries. Instead of iterating over the list, I check the presence of a key.

Accessing a dictionary won’t iterate over the complete dictionnary. Each key is indexed. This is also used with HashMaps in Java.

Let’s see the result:

all_brands = Brand.query.all()
all_colors = Color.query.all()
all_cities = City.query.all()

dict_brands = {}
for brand in all_brands:
    dict_brands[brand.name] = brand

dict_colors = {}
for color in all_colors:
    dict_colors[color.name] = color

dict_cities = {}
for city in all_cities:
    dict_cities[city.id] = city

for line in lines:
    brand = line["brand"]
    if not check_brand(brand, dict_brands):
        continue

    color = line["color"]
    if not check_color(color, dict_colors):
        continue

    city = line["city"]
    if not check_city(city, dict_cities):
        continue

    amount = line["amount"]
    if not check_amount(amount):
        continue

    item = parse_line(line)
    db.save(item)


def check_color(color, dict_colors):
    return color in dict_colors

This way, iterating over a single line implies no loop and no database access.

As said, this kind of optimization boosted the time response 30 times better.

If you want to learn more about good quality code, make sure to follow me on Youtube.

How I’ve Optimized a Batch System by 30 Times

Discover more from The Dev World – Sergio Lema

Comments

Leave a comment Cancel reply