These are some improvements to the statistics db creation, which add up to reduce the time required to generate the db for the MAWI dataset by more than two minutes.
Commit dba4d07cd1 replaces the first loop scanning the whole pcap file. The original code used FileSniffer and SnifferIterator from libtins to count the packets and extract the last timestamp, which is convenient but slow, since SnifferIterator seems to create an internal copy of the packets even when it's not actually read. I replaced this with a function that uses libpcap (which libtins is based on) directly. While the code is pretty ugly (you can only do so much when using a C-style API), it's about ten times as fast.
Commit f6b6b43b9f replaces the call to tcpPkt.mss() with a piece of code that manually searches and extracts the MSS option, if available. The problem with tcpPkt.mss() was that it throws an exception if the MSS option isn't found, and the consequent stack unwinding caused a noticeable slowdown, especially with PCAPs containing lots of TCP packets without this option. For the MAWI dataset, this change alone reduced the required time by 90 seconds.
The third commit includes only minor cleanup changes.
I also sent a pull request to the libtins project to avoid excessive malloc() calls when converting a MAC-address to a string. This was accepted and merged, so the next libtins release should provide another performance boost.
Everything was tested on Arch Linux and macOS 10.12.
These are some improvements to the statistics db creation, which add up to reduce the time required to generate the db for the MAWI dataset by more than two minutes.
Commit dba4d07cd17d926f48dc437a4bda7b573cf98678 replaces the first loop scanning the whole pcap file. The original code used FileSniffer and SnifferIterator from libtins to count the packets and extract the last timestamp, which is convenient but slow, since SnifferIterator seems to create an internal copy of the packets even when it's not actually read. I replaced this with a function that uses libpcap (which libtins is based on) directly. While the code is pretty ugly (you can only do so much when using a C-style API), it's about ten times as fast.
Commit f6b6b43b9fab4a032781aacfd4388a67406190a4 replaces the call to tcpPkt.mss() with a piece of code that manually searches and extracts the MSS option, if available. The problem with tcpPkt.mss() was that it throws an exception if the MSS option isn't found, and the consequent stack unwinding caused a noticeable slowdown, especially with PCAPs containing lots of TCP packets without this option. For the MAWI dataset, this change alone reduced the required time by 90 seconds.
The third commit includes only minor cleanup changes.
I also sent a pull request to the libtins project to avoid excessive malloc() calls when converting a MAC-address to a string. This was [accepted and merged](https://github.com/mfontanini/libtins/commit/3659d89c257676da6e6ddf6298252aecc5756bdb), so the next libtins release should provide another performance boost.
Everything was tested on Arch Linux and macOS 10.12.
These are some improvements to the statistics db creation, which add up to reduce the time required to generate the db for the MAWI dataset by more than two minutes.
Commit
dba4d07cd1
replaces the first loop scanning the whole pcap file. The original code used FileSniffer and SnifferIterator from libtins to count the packets and extract the last timestamp, which is convenient but slow, since SnifferIterator seems to create an internal copy of the packets even when it's not actually read. I replaced this with a function that uses libpcap (which libtins is based on) directly. While the code is pretty ugly (you can only do so much when using a C-style API), it's about ten times as fast.Commit
f6b6b43b9f
replaces the call to tcpPkt.mss() with a piece of code that manually searches and extracts the MSS option, if available. The problem with tcpPkt.mss() was that it throws an exception if the MSS option isn't found, and the consequent stack unwinding caused a noticeable slowdown, especially with PCAPs containing lots of TCP packets without this option. For the MAWI dataset, this change alone reduced the required time by 90 seconds.The third commit includes only minor cleanup changes.
I also sent a pull request to the libtins project to avoid excessive malloc() calls when converting a MAC-address to a string. This was accepted and merged, so the next libtins release should provide another performance boost.
Everything was tested on Arch Linux and macOS 10.12.