Šimon Mandlík, Matěj Račinský, Viliam Lisý, Tomáš Pevný.
Year: 2022, Volume: 23, Issue: 298, Pages: 1−5
Standard machine learning (ML) problems are formulated on data converted into a suitable tensor representation. However, there are data sources, for example in cybersecurity, that are naturally represented in a unifying hierarchical structure, such as XML, JSON, and Protocol Buffers. Converting this data to a tensor representation is usually done by manual feature engineering, which is laborious, lossy, and prone to bias originating from the human inability to correctly judge the importance of particular features. JsonGrinder.jl is a library automating various ML tasks on these difficult sources. Starting with an arbitrary set of JSON samples, it automatically creates a differentiable ML model (called hmilnet), which embeds raw JSON samples into a fixed-size tensor representation. This embedding network can be naturally extended by an arbitrary ML model expecting tensor inputs in order to perform classification, regression, or clustering.