CISO Panel Service Mesh Day 2019

Transcript

Matt Klein

Okay. Sure. Cool. So this is a end user panel. So while I read the questions that I was just sent aa, I’m going to let everyone introduce themselves.

Okay, great. Okay. Uh, yeah. Hi, I’m snow. I work with square on their traffic and observability team, uh, been working on building out, um, are migrating us towards using an Envoy service mesh for the last year or so. Uh, I’m also an Envoy maintainer.

Ryan Michela

Hello. I’m Ryan. I’m at aa Salesforce. I work on our service mesh team. Uh, we’ve been building a service mesh for about two and a half years now and really been through the, from service meshes from the beginning and have, you know, kind of learned a lot of interesting lessons along the way.

Ben Plotnick

I’m Ben, I’m from Yelp. Uh, I’m on our engineering effectiveness team. Um, we do a lot of things, but one of the things that we do is, um, deal with service meshes, I guess.

Matt Klein

Okay,

Cool. Thank you. Um, so the first question is, uh, I guess, could you briefly describe to everyone, uh, what types of workloads are you currently running? And maybe stuff that you aren’t running, but you’re planning on running and I don’t know, maybe just talk a little bit about why the service mesh is useful for running those workloads.

Snow Pettersen

Um, so for us, we’re using a, for services, service traffic. We’re um, our, our main goal for using it was to simplify the, um, the story for we’re talking between services within all our different languages, applications. So we have hundreds of services written in like three or four different languages. All of this we wanted to, um, give all the app owners a consistent way of doing services, service traffic. Um, so that’s the stuff that we’ve been working on and rolling out over the last year. Um, we’re also starting to look at, um, putting them on the edge. Um, uh, right now we’re running like a homemade Netty based, a reverse proxy. And we’re kind of looking into, um, replacing that eventually with, um, uh, Envoy to kind of, uh, complete the mesh and use and use it everywhere.

Ryan Michela

Sure. Today at Salesforce we use Envoy for all a lot of our east west traffic inside of our data centers. Um, we have many properties at Salesforce from a whole wide range of acquisitions and we’re looking to bring them all together into service mesh. Our main CRM application, we use service mesh underneath to hook all of the microservices together along with our monolith and get all these different components working together seamlessly. We use it for all of the reliability and observability and survivability features that you get out of Envoy, how we, it’s important to us to move all of that kind of stuff out of application code as was talked about earlier, get it out of the application code. So your app devs don’t have to think about retries and circuit breakers and networks and all the things that can go wrong. Uh, additionally, we’re working towards making Envoy our north south traffic as well, getting it up to our edge gateways and such.

Ben Plotnick

Um, so at Yelp we actually have kind of an older service mesh pre, uh, I guess term, uh, before that term was coined, uh, based on HA proxy. Uh, it’s actually worked pretty great over the past few years and it’s gotten us, uh, from our monolith to, uh, now we’re probably have too many services. Um, uh, but, uh, we, we decided about a year and a half ago that, uh, we’ve, we’ve kind of stretched that to the limit. We’ve extended our existing HA proxy, uh, SmartStack based, uh, solution to have fault injection and everything and, but, you know, locality based routing and whatever. But, uh, we’ve kind of stretched it to the limit and Envoy is now at a point where it kind of does all of this stuff really easily and very, um, reliably. So we’re swapping that out, uh, for service to service traffic. For edge traffic, um, we’d like to look at Envoy to, um, to do this, but, um, it’s not really a priority right now. I guess we, yeah, it’s not going to give us that much more than what we, what we have already.

Matt Klein

Cool. I was sent these questions, but I’m just going to go totally off script. So, um, I’m, yeah, so I’m curious, so we’ve talked obviously about synchronous traffic. I’m just curious, are there any use cases where you would want to do something like Kafka or something through, through Envoy? Um, it’s come up. I, you know, and you can say no, I’m just curious if that’s come up, uh, in terms of the synchronous versus async bridging, you know, if that would be an interesting thing or not.

Snow Pettersen

Um, it has not come up for us really. Um, I think so. We’ve, we’ve, we’ve primarily been keeping it, we already have a lot of infrastructure built around supporting all of the various protocols, so we haven’t really looked at an Envoy for it.

Matt Klein

Okay.

Ryan Michela

We’ve been following the, uh, the PR… We’ve been following the work item for adding Kafka support to Envoy diligently,

Matt Klein

but is there something that you would like, what would you want it to do in a perfect world, I guess because there’s so many things that it could do.

Yeah.

Ryan Michela

Our customer, the the people that come to us on the service mesh, we want to be able to find, uh, find Kafka and find the right Kafka using the same kind of abstract service discovery names that they use to find their services. Um, ideally find it, you know, the great thing about the service meshes, it helps you find your resources without having to be tied to DNS and be tied to the network infrastructure. But then suddenly you want to go use Kafka or a message broker of any kind. And Oh, now you’ve got to know the IP address. Now you’ve got to know the ACLs. Now you got to know, so you lost all that magic of the service mesh.

Matt Klein

Cool.

Ben Plotnick

Uh, yeah, nobody’s actually asking for that. But I think, uh, a case that I’ve, I guess what I hear a lot is that, um, even though nobody’s asking for, once we give them these things, they love it. So, um, I think for Kafka specifically and various protocols like that, uh, stats are kind of the big thing that we’re missing right now. Um, uh, actually kind of just standardizing stats for even the HTTP traffic, uh, is, uh, something that people are looking forward to. Um, we actually did, we do have a lot of our database traffic running through Envoy right now just because Envoy handles, um, uh, I guess persistent connections and uh, kind of handles the service discovery part way better than what we had before. So yeah, we, we, we are starting to send more protocols through Envoy.

Yeah. Kafka would be really cool.

Matt Klein

Cool. Um, all right. Okay.

Ryan Michela

We have a mandate to mutual TLS all the things and Envoy is by far our easiest path to mutual TLS because Envoy deals with it and it’s not up to all of the different databases and other third party products that don’t necessarily have good MTLS support built in. Like we don’t have to try to shoehorn MTLS into old open source. We can just put it into Envoy. And so the more protocols Envoy supports out of the box or even just dropping down to TCP based services, we get MTLS for free and, our, like, infrastructure people and security people love that.

Matt Klein

Cool. Uh, do you have an answer?

Sno Pettersen

Um, no. No.

Matt Klein

Um, okay. So switching topics slightly, uh, you know, I, I, as you all obviously know, Envoy’s only been open source for two and a half years and rightly so. I think a lot of people are skeptical about quality, reliability, all of those things. Um, so I guess, I don’t know, could you give your perspective on, uh, what about the project made you comfortable trusting it to do what you’re doing with it when it’s relatively new? Um, I think that would be interesting for, for, for people to hear.

Snow Pettersen

Yeah. Um, so just to kind of touch on how we ended up choosing it, we kind of looked at it and found that you were first attracted by the, uh, XDS API, being able to configure it remotely, centralized control plane, all of that stuff. Um, and then as we started using it, we, we found that, we started sort of testing it locally. And at the time, the, uh, it just was way better than the in-house sidecar implementation that we already had. It was faster, it was easier to reason about, it was much. Um, uh, much more performance. So we didn’t have, we’ve, over this last year, we’ve never really gotten any issues with like Envoys performance or like a correctness, uh, beyond us misconfiguring it. Um, so we’ve, we’ve just been able to generate a ton of confidence in it because there’s just never gone wrong. Obviously there’s bugs, has been issues, but really in general, um, we have had no performance issues with it at all.

Ryan Michela

We adopted early Envoy pretty early on. The fact that Lyft was using it internally for a while before you opened sourced it was heartening to us. It was so we knew that there was already a proof point that this was a project that a major company that had a lot of traffic was using successfully. So that helped us kind of agree to adopt it. And then, you know, we’ve been running it in production. We ran it in our test environments for a while, ran it production and did a seamless migration from Linkerd over to Envoy and no one noticed. It’s been solid.

Ben Plotnick

Um, for us when we started, I did like a, we have these hackathons and I did a hackathon project to see what Envoy would look like if we replaced HA proxy, um, in the service mesh with, with Envoy. And I immediately hit a bug, a performer, or a performance issue. And uh, that was, that was kind of scary. But, um, it ended up being that Matt like fixed the thing like within a day or something. And, uh, I think I’ve had like two or three other, uh, issues that, uh, we’ve seen where, you know, I hit some weird use case, might not be a bug or whatever. Um, and it gets fixed like, I dunno, sometimes within hours. Um, so the community, the strength of the community is, uh, very, uh, very much builds confidence in, in, um, us using Envoy. I mean, we’re comparing this with like NGINX and HA proxy, which we both, which we use both, uh, which have been around forever.

So it’s hard to make the case that like this new thing is going to be as reliable as that. But when you have a community that is, uh, you know, active in Slack and active on Github and will help you out, it’s a much easier case to make. And you know, we run on Master. Um, a ton of companies run on Master. Master quality is very good. Um, and that’s, yeah, that’s great. It’s, so we’ve, we’ve had really no issues since then. I don’t think it’s ever crashed. Um, most of our issues are again, like Snow said, with our configuration.

Ryan Michela

I’d also like echo that the strength of the community is a major thing that keeps us, made us really trust working with Envoy. Um, the Envoy community has been really open to outside contribution and outside bug fixes and you know, really friendly and easy to work with. Um, one of our, one of my team members has been very active in community contributing to Envoy and uh, he’s had a great, a really good experience and we’ve always found Envoy, Envoy community to be very responsive. Yeah.

Snow Petteren

Yeah. Like, and similar to whatever everybody else is saying, just the fact that the co, we had a lot of confidence in whatever issue we run into, we could get fixed. So there was no, there was none. So we were very confident that, um, even if you would run into issues, performance or behavioral things, open up an issue and you talk about it and you get feedback within the day and it’s a very fast turnaround and I’ve been able to go from seeing a, an an issue like a feature disparity, whatever, to having a fix in within four or five days. And then we deploy it to production and being able to do that and getting, um, people that really understand the subject matter, like Matt, review the code so that you are very confident that it’s a good fix and it’s a scalable thing. Um, made us even more comfortable running it because it is not us making our own thing at home. It is going being vetted by people who really understand the, um, the proxy domain.

Matt Klein

Cool. All right, let’s, let’s switch to the fun one. Uh, could you, uh, each briefly tell us your biggest production horror story with Envoy?

Snow Pettersen

Um

Ben Plotnick

Huh.

Ben Plotnick

Uh, I, I can tell you two cases. The same case, which is that, um, we’ve DDoS-ed ourselves ourselves twice. Um, with Envoy. Um, the first, the first was, uh, we misconfigured health checking and uh, we, we had Envoy health checking the, the services themselves. Envoy can actually do like health checking, caching where it sits in front and Envoy can take pretty much whatever traffic you send at it, but our services cannot. Um, so when you, when you health check your services, um, from, you know, thousands upon thousands of of other services, you get an n squared, uh, health checking problem, which is basically your own botnet. Uh, and, uh, yeah, we, we didn’t take down production, but we did take down, um, our entire development environment.

Ryan Michela

So we’re always very careful at Salesforce. We say trust is our number one value and all that. And we take that very seriously. So we test very thoroughly in our pre-production environments. So we’ve never had a production outage because of Envoy. Um, we have hurt ourselves in our internal development environments for brief periods of time. And I would say it’s, we implement our own control plane. So we took the XDS control plane APIs and we built our own internal one and they’re not trivial to implement. And there’s actually a lot of interesting state management you have to consider when you’re building your own control plane in orders of operations and such. So you can do something like you can create a service and then if you reboot your control plane, you can tell the Envoy that that service exists, but it no longer has any endpoints.

And Envoy would we, I think this has been fixed, it has definitely has been fixed because we’ve helped fix it. Uh, it will happily say, oh, okay, all the end points went away. There’s nowhere to route the route is dead. But that’s been, and that was actually a control plane implementation issue on our part. And we fixed that on our control plane. And now one of the things we really have found excellent about Envoy is when our control plane, say crashes, because we glitched it in, you know, our pre in our development environment. Like the Envoys just keep humming. We’ve had control plane issues that have gone on for weeks and we didn’t notice them because Envoy will just happily keep doing whatever you told it to do last when your control plane vanishes or is recycling every 30 minutes.

Snow Pettersen

I think we haven’t had any like production outages. Um, we’ve, probably the biggest issue we had was giving, um, with Envoy we gave people the ability to control traffic flow. Um, so something that I had had previously been using a static IP was now, um, the app owner could kind of enable and disable, um, hosts for that, which at some point caused our, like internal SSO portal to go down because um using the UI to update this, they broke the, the, the disabled, the service that it depended on, which broke the SSO portal and then everything was down for like a few hours just because we had added in this like dynamic, a configuration option for app users. And we didn’t have the right safety guards in there, which caused a internal outage because somebody like misconfigured something. Um, that’s the biggest issue we’ve had. And again, it’s not about the process being overloaded or whatever it was about our configuration of it, not having the right, um, um, safety yards in place.

Matt Klein

Yeah. So one of the biggest questions that I get from potential new users as, where do I get my control plane? Uh, so one of the things I would love all of you to answer is you have all built your own control planes. Um, you know, you probably did that because you’re fairly early adopters. Um, I, I’d love to hear your thoughts. If, if you were to talk to a new Envoy user today, how, how would you counsel them to think about whether to build their own control plane, either on top of Go control plane or Java control plane or should they go use a vendor, whether that be app app mesh, or Istio or something else?

Snow Pettersen

Yeah. Um, from our perspective, it seems like the biggest factor is - depends on the complexity of your current service discovery setup. Um, if you’re starting from, um, from nothing, you probably shouldn’t build your own control plane because, um, you can find vendors, you can find managed systems that can do it for you. Um, if you do have a existing service discovery and routing system in place that you need to mimic, um, or if you have, um, in our case we had, uh, our homemade RPC, um, library that had very custom uh retry mechanisms and hedging and all these kinds of things that we needed to support and no knowledge of backing topologies and all that kind of stuff that need to support. So we needed very specific behavior. And you also had an existing circular based service discovery data store that we, all the data was in the registration pipeline.

All that stuff is already working, was in place. We didn’t have to touch that. So for us, the made a lot of sense to put a control plane in front that consume this data and generate the very custom behavior that we needed. Um, so instead of us trying to like find some, uh, managed service and beg them to add in all the features we need, uh, to do our very specific behavior, we, we built it our own and it’s worked well for us and it’s been able to like let us have full control over it to, to, uh, very effectively mimic the behavior of the old system, which again is allowed us to rule it out without people really having to care whether or not we’re using the old system or using Envoy.

Ryan Michela

Uh, when we started building our control plane, uh, Istio was already beginning to be and so we decided we were going to start building our own control plane but only do the bare minimum necessary to get the service mesh running with things that, you know, we started with just implement service discovery. Um, since then we’ve added some additional features, um, uh, surface protection. Um, uh, our back, we started to add some of those, but it was the bare minimum. Just get started … just get your Envoys configured and give them their discovery information. If you’re starting greenfield, don’t build your own control plane. They’re complicated, they’re fiddly. There’s state, there’s ordering. The XDS APIs are very well documented, but there’s still just, there are edges. Istio is also becoming progressively more mature and you know, where we are now in the process of converging Salesforce on to open source, pure open source community Istio. Like our goal is we don’t want to own forks of anything.

Ben Plotnick

Mmm. Yeah. I think I’m in the same boat as Snow where we have a pretty simplified view that, uh, developers interact with. So like we don’t, we don’t want developers to think about this massive YAML or whatever, like what we want developers to have a pretty simple opinion and, um, like a, a small set of options of how to configure their services. Uh, and, and this is, this makes it so that developers can like spin up services really easily and don’t really have to think about it all that much. Uh, so our existing system had this, um, and we built a control plane to as like a, uh, I guess like branch by abstraction type thing where where we, we can swap between, uh, our old, uh, HA proxy based system and Envoy without the developers knowing. And that’s exactly what we, what we do. We’re actually kind of partially on Envoy, um, and we could do it very gradually in the developers don’t know the difference. And that’s basically our main goal.

Matt Klein

Cool.

Uh, is this session 30 minutes or 40? Right. Right. Okay. I definitely know what I’m doing. Um, great. Uh, no, that, that was awesome. So, I guess I can ask one more question, but, uh, we have just a couple of minutes left. Is there anyone out there that would like to ask a question of these folks?

Matt Klein

Nothing. Okay, great. Sorry. What?

No. Okay. Um, so yeah, I guess my last question in closing would be, you know, I’ve heard all of you talk about, you know, you’ve had issues because the configuration is complicated and you know, from a project perspective where we’re always having this tension point of people want to do a bunch of complicated stuff. Yet, people also want very simplified configuration that is, uh, impossible to get wrong. Um, and, and I guess at least from my perspective, it’s almost impossible. You know, you can’t satisfy both camps, I guess, which is why personally I think that you have to have layered abstractions on top. They get more and more opinionated. Anyway, I’d like to hear each of your thoughts from an end user perspective on, um, either, you know, what could Envoy do to make it more accessible to people, whether that be better documentation or better error messages or, or any other thoughts that you have for people, uh, to help them maybe not hit some of these rough edges.

Snow Pettersen

Yeah. Um, documentation and examples. It’d probably be the biggest thing. Um, a common thing is people complaining about lack of complete examples of, of how to do things. Um, you can often see like examples of like specific, like maybe filter, aa specific filters, but people were like, but how do I do HTTP2, to , you how do I use gRPC? Like larger, larger examples, more complete ones. Documentation is also, yeah, it’s, it took us like weeks to be a comfortable digging around the documentation, um, being able to find what we were looking for, um, another search UI, whatever it is. It’s like you need to know the documentation to be able to navigate it. So I think, think definitely it can be improved. Um, and then yeah, I agree with you. You do need to keep the idea itself very, um, large, very powerful. to be able to like express everything. And I do think that at some point it, look, I think a lot of the vendors provide like simpler APIs, like Istio will provide a simpler API on top of it. Um, and some of that is useful, but enough for our use case, we do need the very bottom because I’ll add a feature and I need to use it right now. I’m not gonna wait for, um, several layers of abstraction to add it in when like there’s another way.

Ryan Michela

So thinking about service meshes, they are fundamentally complicated. Like. Let’s not kid ourselves service mesh, Kubernetes, Istio, like this is extremely complicated stuff. It does a huge wide array of things. And I think there’s this idea with the idea of accidental complexity versus essential complexity. And as far as I’m concerned, Envoy is very good about not introducing accidental complexity, the unnecessary complexity to something. But it’s fundamentally complicated stuff. So I know we’re at a surface mesh conference, but do you need one?

Really that’s a, that’s an important question to ask yourself. If you’re building something and you’re looking at it and like, do you actually need the complexity of a service mesh? If you do embrace it, learn it, it’s hard, but it’s worth learning. If you don’t, don’t just add one to have accidental complexity in your system.

Matt Klein

Totally agree with that. I totally agree. Yeah.

Ben Plotnick

Yeah. Good point. Uh, I, I do think, I do think that that you can pick and choose the complexity though. Like I think the, the scope of things that, that you can do, it’s a diner menu, but you know, you can pick your, your, the things that you need, um, and don’t just kind of get the kitchen sink because it’s there. Uh, I don’t know. We, we, we don’t have what people really think is a service Mesh. We, oh, we don’t have Envoy in front of all of our services. We actually have it only on egress and not on ingress. And that’s because we really didn’t need it, um, yet. So like, yeah, pick what you actually, the problem that you actually want to solve. Start with that and then, uh, figure out what technology or what configuration you need to solve that problem.

Snow Pettersen

Like, I think, Envoy … like, usually, um, associated with service meshes. But it, reality is just an extremely flexible proxy and you can use it for whatever you want. And a lot of people like to use, go for like a full service mesh. But like Ben says, you don’t need to, you can, um, if you, if you just want to use it to do consistent hasing, as Ben is, you can do it, like, there’s nothing. You don’t have to use all of it. So yeah, pick the pieces that you actually need.

Matt Klein

Cool. Ah, ah, that was awesome. Thank you. Uh, for our panelists.

Back to Blog