One of the biggest challenges when developing institutional knowledge is the consolidation of many different datasets into a single environment.
In the past two months, we have been working with our clients on integrating and mapping datasets into their reKnowledge environments. From software log information to OFAC sanctions data, from medical trial data to far-right websites data dumps, we had to deal with vastly different requirements and very different domains.
Typically, for any of these datasets, it can take weeks for developers to make sense of the data and its structure, let alone standardise it and map it into a new form that can be leveraged by domain experts.
Furthermore, at this stage, data and information are rarely available in a way easy to consume. As a result, the domain experts who are to leverage the data eventually, have no easy way to be involved in the early phase of data preparation. This lack of domain experts input in the data preparation stage is highly problematic because developers often need to make assumptions about the analytical value of certain information.
reKnowledge is changing that.
Take the example of a well known Neo-Nazi website data dump. In November 2019, a group of hackers leaked the entire database behind the Neo-Nazi website.
From an investigative perspective, this is a huge treasure trove of information. However, the format of the raw data was not ready-to-use by domain experts.
The data dump contained the entirety of the site’s data and information. This includes usernames, email addresses, the users’ IP addresses, all the posts and comments as well as private messages.
Furthermore, a lot of the qualification data (gender, date of birth, ideology, etc.) were inputted by the users themselves, giving the analysts unique insights into these members.
All in all, the data dump was made of over 100 different files and tables. Among them, there were loads of technical logs with little or no analytical value. Furthermore, there was no information about the overall structure of the database, thus preventing us from easily connecting the various tables with each other.
Thanks to our proprietary technology, we were able to analyse the raw data very quickly and then define a remapping and data clean up to organise the data in an analytically meaningful way.
Among the raw data analysis, we needed to understand the information structure (i.e. which tables were related together, what were the connection fields, etc.), identify the fields with analytical value, and analyse the nature of the fields (what type of data, whether the data was standardised or not and so on).
Once the raw data had been thoroughly analysed, we were able to programmatically re-organise and map the data into our client’s own reKnowledge environment.
Typically, to do such exercise, a developer would write a whole script to transform and remap the data. However, writing such scripts takes vast amounts of time, and can get incredibly complicated with certain data sets. Besides, if the remapping does not fit the analyst’s needs then it takes even longer to rework it.
However, with our technology, we can programmatically transform the knowledge by providing a configuration file. Instead of hardcoding in a script every step to take to transform the data, we write a configuration file that defines the transformation and remapping rules.
This approach has the dual benefits of a) massively reducing the time it takes us to transform and map data, and b) amending and refining the data preparations steps in collaboration with the domain experts.
For instance, the user IP address could be stored as an attribute of the individual members. If the user has multiple IP addresses (because he uses a VPN or multiple devices) we can store this information in a list, too.
However, storing the IP information as a node attribute may lead to missing interesting analytical insights.
With our technology, it is just a matter of tweaking the configuration files to turn information from an attribute of a node class to a node class of its own, and a relationship between this IP and the individual.
Transforming the data in this way, enables our customers to find new insights, such as what IP addresses are shared by multiple users.
Another example of how bringing in domain experts in the data preparation stage is useful was the field ideology.
In the raw data, one table had a series of fields with no name but field_x. To understand what each field meant, we needed to look at the underlying data points for those. To a domain expert, it was obvious that field 11 referred to ideologies:
Knowing that we were then able to configure our mapping tool, to create a field “ideology” on the individual scorecard and add the value of this field 11.
Once the entire Neo-Nazi website data was processed and imported into reKnowledge, the domain experts could get to work in earnest.
First, using the analytical workbench, the analyst can start exploring the data iteratively. For instance, we started by looking if there were any members connected to the UK.
9 members had reported their location as the UK (other British members used different labels such as “London, UK”, GB, etc.).
Next, we asked the system to show us all the connected information we had about those 9 individuals.
This query returned quite a bit of information. Of the 9 members, we can now identify 4 very active users (the individual nodes connected to a lot of posts). In addition, we have 2 individuals that don’t post anything but that access the Neo-Nazi website from many different IP addresses and devices.
While it is not immediately clear why those users have used so many different IP addresses and devices to connect to the website, a detailed look at the profile of the users shows that in both cases they were active on the website only for a short time. In both cases, the users were active for about 2 months in the fall of 2017.
Searching any nodes connected to those IP addresses and devices, we also noticed that one of the users shared numerous IP addresses with other members of the website.
While none of these initial insights are smoking-gun evidence of anything special, they nonetheless open avenues for further investigation.
For instance, here the next step would be to further investigate those IP addresses that are used by several different members. Something that becomes child play with reKnowledge web-browser addon. However, this post is not about online research but data integration, so let’s explore further the benefit of mapping information and data in reKnowledge.
One of the biggest strengths of reKnowledge is that it is effectively a highly customisable database on steroids. As a result, domain experts can build incredibly powerful queries using only their logic and our intuitive interface.
For instance, to understand what ideologies are shared between US-based female members and European-based female members we created the following query:
Which returned the following result:
The same query run for male members returns substantially more information.
Just by looking at the graph, analysts can spot which ideologies are the most shared across American and European members. Not surprisingly, fascism, national socialism and nationalism ranked as the most shared ideologies on this Neo-Nazi website.
Likewise, if you are interested in understanding what are the ideologies backed by people posting about Golden Dawn, you can run this query that is looking for any post containing “Golden Dawn” that have been posted by individuals that have an ideology.
This query returns a lot of clusters post-individual-ideologies, which makes sense since we have not yet standardised the ideologies' naming. Typically, such work is done in conjunction with the domain experts that defines the grouping policies.
Nonetheless, two ideologies have shared members that have posted about Golden Dawn, and they are again National Socialism and Fascism.
And here again, the power of reKnowledge is plain to see. Not only can a domain expert query their knowledge through the advanced search, but they can also now interact with the resulting data. For instance, we may want to understand in what locations and time zones those individuals are. By filtering out the ideology and posts nodes, we can now select all our individuals and add them to our advanced search query.
Analysis and investigations are inherently an iterative process where one searches down a lead. If they find interesting insights, they can easily save their graph or print it out as a jpeg. If one finds nothing, they can quickly move on, by removing the irrelevant information and focusing on the next lead.
Finally, analysts can export the newly structured data to client facing systems or their favourite data visualisation software to further their investigation.
By Julien Grossmann and David Costa Faidella