Introduction to Asynchronous Programming
Abstract
Asynchronous programming are often used in I/O bound operations, where a server performing a large amount of I/O tasks to the disk, or processing huge amount of network requests.
By the mean of asynchronous, the program pauses the current operation and do other works while waiting for I/O operation to complete. As a reverse engineer, this scenario may resembles to task dispatching in operating system, and actually they all share with the same idea. The operating system pauses a thread and perform context switching to do other jobs while the thread waiting for a certain event.
However, context switching in operating system introduces great amount of overhead, and lots of unnecessary operations are performed in kernel (System calls are costly!). Hence asynchronous programming moves the idea of threading into user-space, and manage them by themselves.
Scenario
Imagine you are a heavy user to cloud drives, and you have 4TB of data on the cloud. One day, you discovered another cloud service offers more space with less cost. You decided to move your data from your current cloud provider to this one. Yet, moving such a large amount of data would still be costly. Luckily, you found a service that can help you moving these data, which cost $1 per GB. Using their service, you need $4096 to move your online drive.
However, buying a 4TB hard drive costs at most $300, and a Samsung 4TB SSD cost $500. So with that large amount of cash, you would better switch to offline storage. But your internet speed would make the process take months to complete. (Thanks for the poor infrastructure in Sydney. Most users can only afford up to 20Mbps uplink traffic.)
Then you discovered online VPS provider, who provides up to 10Gbps network bandwidth, and offers storage for $0.08 GB-month, which costs $350 for storing the entire data for a month. Apparently you do not need to cost that much since you only needs few days or few hours to move your data. You thought this would be the better choice.
Simulation code of your cloud providers
Naive Approach - Download and then Upload
This solution is very simple. Download the entire cloud drive to the storage pool of VPS, and then upload them to your new provider. The process is more like copying files, and it can be done with only a browser.
Automated Approach - Download and Upload files using a script
We can use requests library in python to perform the downloading and uploading automatically.
However, it takes lots of time to transfer a single file as the screenshots show. The script waits for a long time to download the file, and then waits for uploading. This approach only creates one connection for a single file, and it is very ineffective.
Multi-threading approach
Using multiple thread can process multiple files at the once, this will create multiple connections to the remote, making the transfer process more efficient.
Yet, threads are still need to wait until their I/O finishes. We can use more threads to relieve such an issue, but this will introduce overhead in OS side.
Asynchronous I/O Approach - Workers and Jobs
Since we already have multiple threads, and we do not want them to wait, instead of blocking them for I/O operations, we make them working on other jobs, such as processing other files.
A worker is a dispatch unit (thread) that are working on some jobs, and a job is a unit of the actual work the program needs to do (in this case, migrating one single file).
By declaring a function to be async, it becomes a coroutine, which means it can be suspended and resumed. We can use await to call another coroutine inside a coroutine. This keyword means the coroutine can be paused here.
Hence, in this scenario, whenever the program hits await and needs to wait for an I/O to finish, the program immediately pauses the job and works on another one that is ready.
It precisely made a request in every 5 seconds, while still download & uploading multiple files, which is what threading cannot achieve.
Multiple workers
Without threading, a program cannot run on different cores simultaneously on one CPU. So mulitple workers are still required to achieve maximum performance. In this example, we have 8 workers.
Throttling the traffic - Semaphore
In practice, we do not want to make the job sleep and make a request in a period. We want to make the program as efficient as possible. But we cannot just delete asyncio.sleep function call, as it is the only way to pause the main() coroutine and passes the worker to other jobs.
So we need to make it to only sleep for a tiny amount of time. Yet this will make the program generating a bulk of requests (thousand of?) and create congestion in network traffic. Hence we need a way to limit the amount of jobs.
A Semaphore is a thread-safe integer value that block a thread whenever there is no resources available. We are limiting the maximum number of jobs to 10.
Note that asyncio have its own semaphore implementation, but it is not thread-safe. So it can only be used within the same event loop.
Remove the I/O bottleneck - Piping
In all of our codes, we assumed that a disk I/O is required when moving between online disk. Yet, when we transfer data between two physical drive, we do not copy them to a third drive first. We only move data through the memory.
We can combine the fetch and upload coroutine into one, so that we can do both operations at the same time, without disk I/O.
Last updated