Hi James, nice to meet you, could you introduce yourself to our readers please? 

I’m James Arthur, I’m CTO of Hazy. I’m a software developer and entrepreneur by background, so I’ve been involved in a whole bunch of different companies and start-ups over the years, for example, I was CTO and Co-Founder of Opendesk which was an open-source furniture start-up before I Co-Founded Hazy with Harry and Luke.

Tell us a bit about Hazy as a business, what inspired you to set up Hazy, and tell us a bit about synthetic data – what it is, applications, and what you guys do with it? 

Hazy is a synthetic data company, and in a sense that’s a kind of privacy technology, and I think the origins were that it was obvious to us that AI was eating the world and people were moving to do more and more data science and AI and ML, and you have these challenges around access to the data and general data management and preparation, so we looked into that and were interested in what we could do to provide tools to help companies work with their data responsibly and respect customer data privacy whilst actually helping [the company] use the data.

As a company, it went through some early iterations where we were initially focussed on making an automated anonymisation API and then partly because, as we learned more around anonymisation and we realised some of the limitations of the technology, we realised synthetic data was a better solution to the same kind of data privacy problems.

Synthetic data is artificial data which means it’s fake data or made out of artificial data points, and the kind of artificial data that Hazy generates is – not just that you’d say give us a structure of a database table and we’ll generate representative values – what we do is take a source data set, analyse it, you can see the distribution of values and you can see the relations between the values in one column and another, all the patterns, and correlations in the data, and we basically replicate those in the synthetic data, and what that means is that the artificial data we generate is smart enough to be used for data science and training models.

Especially with all the tools that are available now, it makes sense. How has Synthetic Data evolved over the last few years?

Test Data has been a part of data provisioning and data virtualisation for quite a long time. If you look inside banks data architecture, you will have these kinds of data virtualisation systems, and there are companies like Informatica and CA.com who provide tooling to generate test data, and that’s very useful for things like spinning up a development environment and running your unit tests against having a CI system, but what you’re now seeing with synthetic data is that rather than being schema-compliant data, having data that is statistically representative and having systems that are smart enough to preserve the information and the patterns in the inside in the source data, means that it’s now being used for data science and more advanced AI & ML workloads.

Has GDPR been an opportunity for you? 

It’s a big driver for people to take data privacy and responsible data management seriously. Interestingly for us, what we find is that when working with larger enterprises, they already have extremely sophisticated data management practices. Data is a massive asset for them, there’s no way that a bank could allow a data leak. For the people who have that level of sophisticated infrastructure, what we do is increase their data agility or pace of data innovation, because all those factors – if you are treating data responsibly – can mean that it takes longer and can be harder to actually work with the data.

What would you say has been your biggest success at Hazy so far?

We were chosen by Microsoft and Notion as the best AI start-up in Europe, so we won the $1millon Microsoft Innovate AI prize. It was a great accolade.

Hazy is around 15 people at the moment, it seems that you haven’t been in a hyper-growth like a lot of start-ups after gaining some investment. Have you focused more on product-market fit before scaling the team? 

We have scaled out to 15 or 16 people post-seed stage which is probably fairly representative. In a sense, with the kind of products we’re building, we have both data science and development capability in house, so there’s quite a broad range, we needed a cross-disciplinary technical team. I think that one of the key things that we’ve been focused on as a company for the last year since raising a seed round – which we built around that prize win – has been very much focused on product-market fit, and trying to make sure we’re making the right product rather than running away with technical solutions without the demand.

What’s been the biggest challenge over the last 2.5 years? 

Going from a broad understanding of the macro-drivers around the need for data agility and responsible data management and honing in on how you translate that into a willingness to pay from a defined role and budget within the kind of enterprises we’re targeting.

How would you price it? 

We have a base price for an enterprise installation and then we have some tiered usage levels which basically maps to the number of use cases that the tech is being used across the enterprise for.

Stepping back, you need to look at value engineering and understanding the business case, and the value to an organisation today of being able to harness the latest data analysis or improve their data agility and data innovation is absolutely tremendous. Equally the kind of potential risk of data leaks and the reputational damage of that can also be extremely valuable.

There’s definitely a challenge again in taking those broader concepts, where, if you’re a bank and you fail to innovate then your market share will be taken by a challenger and translating that down into a direct budget in a sense is definitely a challenge.

Where do you see Hazy in the next few years? 

We’re on a very high growth trajectory, and we’re seeing an awful lot of demand for the kind of offerings and solutions that we can provide. We definitely see ourselves working with a high proportion of large financial services, and we see Hazy as a systemic solution to safe data provisioning and being built into the core data architecture of these kinds of companies. 

Do you have competitors in this space? 

I think what we’re seeing is that there is a new sector emerging around synthetic data and there are a number of competitors popping up with rival software vendors, or some of the legacy privacy technology providers looking to move into the space.

I think the challenge for some of those competitors is keeping pace with the data science innovation and being able to solve some of the harder problems. The fundamental data science challenge is around being able to generate representative high-quality synthetic data, almost with any data set type for any use case, which is extremely challenging. So, anybody moving into the space does need cutting edge data science capabilities.

Do you think that’s what differentiates yourselves from the competitors? 

Exactly. Hazy’s core data science capabilities and the command we have over the quality of the synthetic data is, at the moment, at the front of the market.

What does a Data Scientist look like at Hazy? What kind of background do they have? 

We have a range, from focused on all research tasks and doing more production of ML technology, so, on one hand, we have one of the world’s experts in some of the deep-learning algorithms we’ve been using, who’s been working in that field for some 25 years! And then, on the other hand, we have some great, super smart, data scientists who come out of our UCL connection… as well as some highly experienced engineers who’ve worked at companies like Oracle and Facebook.

Not a bad team to have! 

Yeah, the tech team at Hazy is really quite exceptional and for me it’s really just a privilege to be working with them.

One thing that always comes up in these interviews with Founders, is that I always ask for advice: either looking back at yourself when setting up your first start-up… or for anyone setting up a deep-tech start-up in London. What advice would you give them? 

Well, I think London is a great place to start a deep-tech start-up, I think that’s widely regarded as one of the current investment hypotheses. Even US investors are looking to Europe for more capital efficiency, more loyalty of staff, and a better academic base.

One of the things I recommend when just starting is to look at accelerators as a good source of initial funding. In terms of raising further investment, or into VC, you have to focus on invest-ability. That both comes from looking at the scalability of the market and problem you’re trying to solve, it definitely comes from team composition – you need to have investible committed founders.

Then in terms of what to actually focus on as a founding team, I think one of the things that technical founders get wrong is a focus too much on building products or technology early, whereas really you have to double down on product-market fit and customer development.

There’s a whole load of great materials about approaching product-market fit systematically on the web, and it’s been very helpful with Hazy where we’ve received some great coaching from the platform team at Notion.vc and that’s been extremely helpful to give us the tools to understand what we should be focused on in terms of going through customer definition, value proposition, pricing to impact.

In a sense, by focusing ruthlessly on understanding what we should be building has helped us now get to a position where the value proposition and a market segment, where the market has come to us, and is now biting our hand off for the technology.

You went from a SaaS API to an Enterprise level product. What inspired you to make that switch?

When we first set up the business, we were interested in democratising access to privacy technology and we’d seen simple first use cases, for example providing an anonymisation API. A bit like Google who has a data-loss prevention API – that kind of technology, but what we realised increasingly as we dug into it, is that it is larger enterprises who have the larger problem about the layers of control and the time it takes them to work with their data. Basically, the bigger the company is, the more enmeshed in red tape they are!

Because of that, honing in on the right customer definition allowed us to realise that we needed to be building on-premise enterprise software. That’s a great example of these steps where you have to understand the problem you’re solving, for who. Who has that problem the most? A burning problem where there will be a strong willingness to pay? Once you have that, the details of what you need to develop technically or the way you need to productise your core capabilities will emerge.

You have to think about budgets as well. Enterprises will have the money to spend on tech. 

Obviously, there are trade-offs – large companies with large budgets mean longer sales cycles. Not saying that you always need to focus on an Enterprise market, you just need to figure it out on a case by case basis for the company.

Look at the problems they face, and develop a product that solves that problem, and price that accordingly. 

From my perspective as a CTO, I feel that often people regard that as a technical role, and in reality CTO is primarily a bridging commercial and product role, and I think that a lot of Founders don’t understand that.

If you’re not in the commercial process, solving the sales process and understanding the insight from the customer discovery process, you won’t have the understanding of what you need to make, which is what you need to bring to the product and development side of the role.

Back to Synthetic Data, what are you most excited about? Any new developments or research that excites you?

Very much so. I think we see with synthetic data that, in a sense, how companies are using data is going to shift from working with data sets to working with models and generators.

For example, in the financial sector, if you look at something like trying to prevent fraud, it can be useful to try and share insight across data sets, but you have to be careful about how you would try to create a data-sharing architecture to facilitate that. What you have with Synthetic data is, first of all, the ability to share artificial data sets because they’re not so sensitive, but also with synthetic data generators, rather than sharing the data you can share these generator models, and that provides the basis of new architectures for how you share insight across these boundaries.

I think also, with some of the deep-learning technologies, for example, generative adversarial networks, you have capabilities like transfer learning and conditional generation that allow you to transform data sets. It almost gives you this core statistical control over the output data, and I think the kind of systems you can build with that kind of technology are going to be extremely important in the future.