The Code Behind LeakProbe
How is such a huge database built and structured?
LeakProbe is primarily coded using PHP (To all the haters, yes PHP 7 is actually pretty damn good!) the database is MySQL and the front end is HTML/JS. It's a pretty standard stack and is probably one of the most popular on the web.
The Database has approximately 2.6BN rows and must provide you with your search results in no more than a few seconds (we're an impatient bunch these days). To achieve this we've tried to keep our tables as lean as possible and indexing is a must! Indexing is a very laborious process and can take days (yes you heard right) if the dataset is big enough.
One of the databases we added was 1.4BN rows, all of which had to be indexed. It took around 15 minutes per 30 Million. You can do the math but it’s a long time. We also try to keep our tables to around 50M-60M rows which means if we do have to do a large operation on a whole table it won’t take a long time. We do try and avoid this though as it might slow down searches for our users.
A lot of detractors say that MySQL will struggle with this many rows and will propose using a different DB (MSSQL, ORACLE) but we’ve found MySQL handles it like a champ. On average a search will take 2-4 seconds. We think that’s pretty damn good for traversing 2.6BN rows.
The back end scripting is handled by our old friend PHP. Although it does get a lot of hate, version 7 is blazingly fast and has added many new features which make coding in PHP a little less dangerous. (Strict Types anyone?). This has enabled LeakProbe to ensure we serve up our results as quickly as we get them from the database. Although the real donkey work is on MySQL, we get no bottlenecks from the PHP side.
We also use PHP for parsing and importing our leaked databases. Yep, you heard that right. We have a number of CLI scripts which we tweak to parse through the huge databases we receive. We can import approximately 1 million records every second.
Our front end is made in the three staples of web development as mentioned above. We utilise design patterns that make searching much easier and fluid such as AJAX. This means that when you click search, rather than loading a new page every time a script in the background goes out and makes a request to the webserver and only loads the data required, and renders it on the screen.
The Server is hosted on a Debian linux box and uses the Apache Web server. We like to try and use open source technology wherever possible. Not only do we believe that linux is more stable than windows server but we find it’s environment more user friendly than windows.