Lightning-Fast Python with Numba (Smart Sampling with Exclusion)
- Oktay Sahinoglu

- Jul 13
- 2 min read
Updated: Nov 4

A tool that can take as many non-repetitive samples from a large set as we want, with the ability to exclude arbitrarily chosen samples, is needed in almost every area of data science. For example, it can be used in a numerical dataset, but it is also very useful in natural language processing cases (such as selecting paragraphs from a corpus).
Well, numpy already has a random choice tool, what kind of contribution are we talking about here?
If you need to repeat this selection process a lot on large datasets, then the picture changes a bit. Let's say you want to train a reranker model and you need to randomly select 50-100 items for each question from your corpus, but you don't want the golden context to be in those selections, because you want to add the golden context into sample group manually to be sure that every group includes the golden one. You need to repeat this process for thousands of questions. At this point, although numpy random choice does the job, it can remain open to improvement at two points. One is the exclusion of the specified example (golden context) and the other is speed.
Thanks to numba and numpy, we can address all the requirements effectively and efficiently as follows.
As you noticed we are not using np.random.choice. Instead, we are using np.random.randint which is faster. Additionally and mainly we are using @njit decorator from numba. @njit decorator, which is alias for @jit(nopython=True), compiles the decorated function contains numpy and python code, so it will run entirely without the involvement of the Python interpreter. These make the method run much more faster.
If we look at the contribution with measurements; In the case where we select 99 samples (+1 manually added golden content, so that sum is 100) from a set of 40.000 samples excluding the golden content, and repeating this operation for 40.000 times for each query; it took about 2 hours with regular way, while it took 7 and a half minutes this way. 16 times faster!
You can apply this approach to any type of data, however, it is strongly recommended to use indices instead of data itself, for efficiency and accuracy.
Hope you like it! Happy coding. :)




Comments