Enrique G. Rodrigo, Juan A. Aledo, José A. Gámez.
Year: 2019, Volume: 20, Issue: 19, Pages: 1−5
As the data sets increase in size, the process of manually labeling data becomes unfeasible by small groups of experts. Thus, it is common to rely on crowdsourcing platforms which provide inexpensive, but noisy, labels. Although implementations of algorithms to tackle this problem exist, none of them focus on scalability, limiting the area of application to relatively small data sets. In this paper, we present spark-crowd, an Apache Spark package for learning from crowdsourced data with scalability in mind.