So I have a program that I made that needs to send a lot (like 10,000+) of GET requests to a URL and I need it to be as fast as possible. When I first created the program I just put the connections into a for loop but it was really slow because it would have to wait for each connection to complete before continuing. I wanted to make it faster so I tried using threads and it made it somewhat faster but I am still not satisfied.
I'm guessing the correct way to go about this and making it really fast is using an asynchronous connection and connecting to all of the URLs. Is this the right approach?
Also, I have been trying to understand threads and how they work but I can't seem to get it. The computer I am on has an Intel Core i7-3610QM quad-core processor. According to Intel's website for the specifications for this processor, it has 8 threads. Does this mean I can create 8 threads in a Java application and they will all run concurrently? Any more than 8 and there will be no speed increase?
What exactly does the number represent next to "Threads" in the task manager under the "Performance" tab? Currently, my task manager is showing "Threads" as over 1,000. Why is it this number and how can it even go past 8 if that's all my processor supports? I also noticed that when I tried my program with 500 threads as a test, the number in the task manager increased by 500 but it had the same speed as if I set it to use 8 threads instead. So if the number is increasing according to the number of threads I am using in my Java application, then why is the speed the same?
Also, I have tried doing a small test with threads in Java but the output doesn't make sense to me. Here is my Test class:
import java.text.SimpleDateFormat;
import java.util.Date;
public class Test {
private static int numThreads = 3;
private static int numLoops = 100000;
private static SimpleDateFormat dateFormat = new SimpleDateFormat("[hh:mm:ss] ");
public static void main(String[] args) throws Exception {
for (int i=1; i<=numThreads; i++) {
final int threadNum = i;
new Thread(new Runnable() {
public void run() {
System.out.println(dateFormat.format(new Date()) + "Start of thread: " + threadNum);
for (int i=0; i<numLoops; i++)
for (int j=0; j<numLoops; j++);
System.out.println(dateFormat.format(new Date()) + "End of thread: " + threadNum);
}
}).start();
Thread.sleep(2000);
}
}
}
This produces an output such as:
[09:48:51] Start of thread: 1
[09:48:53] Start of thread: 2
[09:48:55] Start of thread: 3
[09:48:55] End of thread: 3
[09:48:56] End of thread: 1
[09:48:58] End of thread: 2
Why does the third thread start and end right away while the first and second take 5 seconds each? If I add more that 3 threads, the same thing happens for all threads above 2.
Sorry if this was a long read, I had a lot of questions. Thanks in advance.
Threading is about workers; asynchrony is about tasks. In asynchronous single-threaded workflows you have a graph of tasks where some tasks depend on the results of others; as each task completes it invokes the code that schedules the next task that can run, given the results of the just-completed task.
The differences between asynchronous and synchronous include: Async is multi-thread, which means operations or programs can run in parallel. Sync is single-thread, so only one operation or program will run at a time. Async is non-blocking, which means it will send multiple requests to a server.
Concurrency is having two tasks run in parallel on separate threads. However, asynchronous methods run in parallel but on the same 1 thread.
So why asyncio is faster than multi-threading if they both belong to asynchronous programming? It's because asyncio is more robust with task scheduling and provides the user with full control of code execution.
Your processor has 8 cores, not threads. This does in fact mean that only 8 things can be running at any given moment. That doesn't mean that you are limited to only 8 threads however.
When a thread is synchronously opening a connection to a URL it will often sleep while it waits for the remote server to get back to it. While that thread is sleeping other threads can be doing work. If you have 500 threads and all 500 are sleeping then you aren't using any of the cores of your CPU.
On the flip side, if you have 500 threads and all 500 threads want to do something then they can't all run at once. To handle this scenario there is a special tool. Processors (or more likely the operating system or some combination of the two) have a scheduler which determines which threads get to be actively running on the processor at any given time. There are many different rules and sometimes random activity that controls how these schedulers work. This may explain why in the above example thread 3 always seems to finish first. Perhaps the scheduler is preferring thread 3 because it was the most recent thread to be scheduled by the main thread, it can be impossible to predict the behavior sometimes.
Now to answer your question regarding performance. If opening a connection never involved a sleep then it wouldn't matter if you were handling things synchronously or asynchronously you would not be able to get any performance gain above 8 threads. In reality, a lot of the time involved in opening a connection is spent sleeping. The difference between asynchronous and synchronous is how to handle that time spent sleeping. Theoretically you should be able to get nearly equal performance between the two.
With a multi-threaded model you simply create more threads than there are cores. When the threads hit a sleep they let the other threads do work. This can sometimes be easier to handle because you don't have to write any scheduling or interaction between the threads.
With an asynchronous model you only create a single thread per core. If that thread needs to sleep then it doesn't sleep but actually has to have code to handle switching to the next connection. For example, assume there are three steps in opening a connection (A,B,C):
while (!connectionsList.isEmpty()) {
for(Connection connection : connectionsList) {
if connection.getState() == READY_FOR_A {
connection.stepA();
//this method should return immediately and the connection
//should go into the waiting state for some time before going
//into the READY_FOR_B state
}
if connection.getState() == READY_FOR_B {
connection.stepB();
//same immediate return behavior as above
}
if connection.getState() == READY_FOR_C {
connection.stepC();
//same immediate return behavior as above
}
if connection.getState() == WAITING {
//Do nothing, skip over
}
if connection.getState() == FINISHED {
connectionsList.remove(connection);
}
}
}
Notice that at no point does the thread sleep so there is no point in having more threads than you have cores. Ultimately, whether to go with a synchronous approach or an asynchronous approach is a matter of personal preference. Only at absolute extremes will there be performance differences between the two and you will need to spend a long time profiling to get to the point where that is the bottleneck in your application.
It sounds like you're creating a lot of threads and not getting any performance gain. There could be a number of reasons for this.
If I were you I would use a tool like JVisualVM to profile your application when running with some smallish number of threads (20). JVisualVM has a nice colored thread graph which will show when threads are running, blocking, or sleeping. This will help you understand the thread/core relationship as you should see that the number of running threads is less than the number of cores you have. In addition if you see a lot of blocked threads then that can help lead you to your bottleneck (if you see a lot of blocked threads use JVisualVM to create a thread dump at that point in time and see what the threads are blocked on).
Some concepts:
You can have many threads in the system, but only some of them (max 8 in your case) will be "scheduled" on the CPU at any point of time. So, you cannot get more performance than 8 threads running in parallel. In fact the performance will probably go down as you increase the number of threads, because of the work involved in creating, destroying and managing threads.
Threads can be in different states : http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.State.html Out of those states, the RUNNABLE threads stand to get a slice of CPU time. Operating System decides assignment of CPU time to threads. In a regular system with 1000's of threads, it can be completely unpredictable when a certain thread will get CPU time and how long it will be on CPU.
About the problem you are solving:
You seem to have figured out the correct solution - making parallel asynchronous network requests. However, practically speaking starting 10000+ threads and that many network connections, at the same time, may be a strain on the system resources and it may just not work. This post has many suggestions for asynchronous I/O using Java. (Tip: Don't just look at the accepted answer)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With