You’ve been with for nearly 18 months now, how has your time with the company been so far?

Amazing. It’s just been fantastic. The mixture between the culture and the technology, the push for innovation is something I’ve always been looking for in previous roles. I’ve worked in companies before that have had some aspects of that but not all of them together. There is also a real push to be the best? Not only on a business front but not to be constrained too much by being too scared of using different things like open source, or scared of using technologies that don’t have a 15 year pedigree.

If you work as a data engineer at IBM for example you may only be using internal products, they might say well let’s look at Hadoop but let’s make sure it’s a real managed service offering like Cloudera or any of those services that use open source but manage it for you. Because there is this fear of using open source for a commercial product, you know ‘in case it goes wrong’. Whereas here we have a real strive to use community technology, let’s see what people are using and doing out there. Let’s look at the fantastic array of new technologies because that’s where the exciting things are.

That sounds excellent, so what technology stack are you operating at the moment?

We have been on premise Hadoop for a while, we have been looking to move over the cloud, we are looking at a multi base cloud strategy. We have a big Hadoop structure which is shared by the whole of Expedia, there is a lot of ETL jobs there running all the time, follow the sun 24/7 type of thing, we’ve got the US team running things, but we do find that running while they are asleep is fine but as soon as the US wake sup it slows right down.

Were also running on batch windows. We want to put our stuff in in certain times of the day, so it’s ready for the morning and that also competes with resources so in theory moving to the cloud allows you to just spin up a cluster for a signal job, if you want to go up to 200 Nodes or you want to stick to 4, you can do that depending on how much you want to cost. That’s the fantastic side to it.

But if you then say ok let’s just look at AWS for example, again fantastic products on there but do you really want to say I’m going to put all my eggs in one basket. I’m only going to use that. When you’re looking at Google or any of the others, it is more about being free and when you’re thinking we’ve done this for 6 months, but let’s make sure all the code is technology agnostic, within reason, but to be able to lift that up and say we can now run something on Google Cloud rather than running it in AWS or in Hadoop, if it is the same code it pretty much can be run across all of those, so you can do more things.

So when Data Science come back and say I really want to try these new Machine Learning Algorithms GCP have got that are running on Google, it becomes a great opportunity to try it. Your hands aren’t tied, you’re not thinking it’s going to take us 18 months to migrate to where we’ve been to onto google.

So do you think the speed to production is the main benefit to come out of the advancements in Cloud based software?

Certainly. But I still think the major one, is you can look at your development lifecycle or you can have a developer spin up a cluster that is just right for them, and you haven’t got any conflict or worry about environments any more. You have your QA team who can again have their own environment. Basically if it is all done right they can essentially flick a switch and have an environment all load up for them, do their testing then kill the environment. They have their results and they don’t have to worry about killing anyone else’s code that’s running on the same environment. Or you might have situations where 3 or 4 code levels that are being tested at the time. I’ve worked in places where the development is ahead and UAT is still lagging behind, but you can’t load the 2 of them onto the same environment so you are blocked on the new stuff until the old stuff is finished.

I was at an event recently where the concept was to try and put a definition on various job titles within Data, so to you what is the definition of a Data Engineer?

I think it’s a new term but I actually really like it. I think what has happened is you could be a data warehouse developer, you could be a Hadoop engineer, but now it’s got to this place where data is so big, there isn’t really a term for all of these people, they all work in data. So Data Engineer is I guess put together. But the 3 main things I have found we’ve got warehousing guys who really know the data, we’ve also got the Hadoop engineers who come from the Big Data side who are maybe more focussed on Java, and now Scala and Spark. And they blend really well because you have the warehousing ETL guys who really now all the data but can’t necessarily put all the new processes together, and the Hadoop guys who don’t really understand the data as much as the warehousing guys so they are kind of learning from each other.

The third piece is Devops, I think because we are moving to the cloud, we are moving away from a centralised team who did Devops for us, to maybe having to understand how to spin up your own cluster, and have that more Ops-side of things. Maybe not to the full extent of setting up a whole network or VPC, but to understand how to launch a cluster of how to load the data on. So those 3 things is what the data engineer is doing.

What impact can having a data engineering function can add to a company’s overall data strategy? We see companies that make a ‘Data Scientist’ their first hire, or sole hire in their analytics strategy, but how does having a focussed engineer impact that?

So Data Science to me is like the icing on the cake, where engineering is the actually cake. They can do all their amazing things with their algorithms, but they need someone to get all the data in the first place. And it’s got to be in a format, in a readily accessible way for them, we’re like creating all the core data but giving them the speed to pull say 5 years of data from hits or something like that. It’s more and more data, for those guys its creating patterns, the algorithms have to come from looking from huge data sets, and if those data sets are nonsense or poor quality, it’s impossible to get valid trends. You have to have that core stable data set first for data science to effectively work.

What does 2018 look like for, with all of those advancements?

We are certainly continuing with the migration to the cloud, not just your typical life and shift directly into AWS, were looking to have time and resource to look at those and redesign, push forward with Spark for greater speed. We’re looking at new products, new visualisation like MapD, we’ve done POC’s on a lot of things. MapD is interesting as it’s based on GPUs not CPUs and you can actually push billions of rows into memory and slice the data really quickly. It’s all in memory in a single machine in AWS, so if you have that at the end of your process, instead of pulling Tableau from Teradata, maybe have that underneath that can give you that extra flexibility. This should be instantaneous which like you said; ‘is speed of analytics’. It also gives someone the opportunity to stumble across insight, because they never even bothered to cut the data a certain way because it’s just too painful to try, hopefully they will therefore be able to say’ maybe I can try this’ because it’s so just efficient.

Was that ambition one of the reasons the opportunity here at was attractive to you?

Yes, Expedia is much further down the line with the Hadoop ecosystem than it was at my previous role. Expedia have been using big data for many, many years whereas in my previous role it was more Netezza based warehouses moving to Ab Initio for ETL process, but they had a big data cluster which they were doing some stuff with but it was more of a POC, they had some interesting things running on there and they had a big data team, but it was more the old warehouse. I think they would like to move over to the Hadoop system, but there was the worry about the open source, worry about getting the right people in, so certainly there was a big push for me to move into that area.

It’s a big jump for any company, if you are an established team doing a great job, but what tends to happen is it works, we live in a world of data so you can have something that works brilliantly, but then actually tis like we need to just get a little bit more, then another little bit more, then the whole platform starts creaking.

What’s the next step? Do you buy a bigger one? Trade in the old one? If you have a data centre do you add some more machines to that and get bigger and bigger? Or do you look to something like the cloud where in theory you ca n infinitely scale?

Obviously it costs you money but you don’t need an engineer to come in and install a new machine etc. Its just ease of use. Once you’ve been down the cloud path, you think: ‘Why did we ever buy these things in the first place?!’ I think it is an inevitable jump.

What advice would you give to someone making that jump?

Yeah I would say have a real genuine look at it first. One of the stopping points for this was security. It has always been: ‘ok I’ve got to give my data to someone else’. And I do get it, it is terrifying. But the advances in security, the levels it has got to particularly in AWS but also in Google as well, you can set up your own VPC which is purely your own network purely in that cloud so you have your own private network. It takes a little bit more infrastructure, if you’ve been relying on another team to do your ops for you, it can be tricky but your engineers will learn more, there is more scope to move to new things if you need to upgrade to a version of anything like Hive of Redshift. AWS Take care of all of that. There is no outage periods, none of that. At Sky if we wanted to upgrade we had to book it in months in advance, whereas you don’t have that with AWS you take the latest version on anything or you can chose the version you want.

What drove you toward data engineering as a career?

It is what I’ve always done, I left university and went straight to IBM, and straight away I was working on DB2 and processes, I enjoyed the matching of the warehouse work, to what you were seeing in the main warehouse, we used to have a process I designed that tracked parts that came into our local factory and went through the process where it would come from the customer, through to what station it was at, and it was all tracked by barcode, to put that into a ETL process and have management reports based on that information was really interesting. It was also extremely useful and I got the bug from that.

It is the ability to take something that seems fairly innocuous and draw valuable insight from it, and we were seeing that back that. Seeing it much more efficiently than someone going around with a clipboard!

And I think most Data Engineers want to know they are adding value, rather than saying ‘Here’s a project just crack on and do it’. It is about sitting down with the business understanding their needs, talk to them about what we can do to help. Then start working out how to code it all. It gives it a much greater sense of purpose and direction. And also adds that element of creativity and problem solving as opposed to just sitting with my head phones on coding. It is more like being part of something.

What advice would you give to someone aiming to become a Data Engineer?

One question I like to ask is: ‘If you have an hour’s downtime in work what do you do with it?’ If there’s people who say ‘I’m really interested in…’ I like. Someone who has gone out and researched something or looked into something new, or attended a conference or meet up on that subject. Or set up their own AWS account, maybe doesn’t understand everything but is on the way to. That is really important because it shows genuine interest and is someone that will eventually add value to the team. And that interest is what you can’t teach.