The Four Questions Every Monitoring Engineer is Asked


[funky piano music] [coffee beans grinding]>>So what amazes me is no matter how many times we talk about it, no matter how many times
we, you know, do blog posts, should we talk about it at SWUG, the same four questions keep coming up.>>Yeah, and every time
we seem to get together we talk about it. But it’s because it
wasn’t a throw away idea.>>No.>>These are the same
questions, over and over again, that people end up asking themselves or the people doing the monitoring.>>Right, but I think the
more important point is that the questions speak to
the heart of creating a solid, robust monitoring design.>>Oh, yeah. That if you’re ready to
answer those questions, then your monitoring is by
definition stronger for it.>>Yeah, it’s no longer turnkey. It’s not one of those things
where you just install it, and you’re like, “Yeah, it works.” It’s all of a sudden,
no it’s actually become part of the larger ecosystem.>>So, okay, so it’s starts
off, when somebody says, “Oh, yeah, I’ve got monitoring, I get e-mails, [laughs], right.>>And we do this every time,
we have this discussion, I’ve talked about alert. You’ve talk, we’ve talked
about it at THWACKcamp before, and alerting is great. But alerting is not monitoring.>>Right, alerting not equal to, the pain, equal.
>>Not equal to or not equal to yeah, it’s not the same thing. But, unfortunately a lot of
people that consume monitoring, for a lack of better way to put it, are the ones that receive
these e-mail alerts. So, if you’re going by that,
then you’re in critical, and I am fine and that’s the end of it.>>[laughs] I like it.>>Yeah, but that’s not
the way it really works. The real part about
monitoring is all the stuff before the alerting. The alerts are key and
they’re super important, and you have to have them, but, it’s not the meat and potatoes. It’s not what actually what
makes monitoring, monitoring.>>No, okay, so first question, is what?>>First question is, why
did I get this alert? And everyone gets it, especially when you first install an NMS, doesn’t matter who makes it, what it is. Someone says, “Oh, yeah,
I’ll put in my SMTP server, and I’ll put in some authentication, and I hit, save, and boom the flood gates are just opened.”>>Right, right, yeah, turn them all on. Turn them all on let’s see where it goes. Yeah, no, so, do you end
up with people asking you, “Why did I get this?” And it really comes down to one thing. Their alerts suck.>>Yeah, well, if they have to, if they ask that question in that way, sometimes it’s a legitimate question. Sometimes it’s like, “Why did I get this, “I don’t understand, why I’m
involved with this process.” They should already have that though. The real question, the root
of this question though is, should be answered by the content.>>Yeah, exactly.
>>It’s like, “Why did you give me this pamphlet?” “Because I would like
you to see this museum.”>>Okay, fair enough.
>>It’s, “Why did I get this alert?” “Because, here’s all the
things that pertain to you.”>>But so many alerts are, “It is down.” Like that’s it, that’s it. Something happened, it is down. Could you be more vague,
no I don’t think so.>>They could try. They could not give you the name. Something is, they gonna
literally say something is down.>>That’s what I was talking
about, I have seen these, I have seen these–
>>No, in the wild.>>Yes, a thing is down, yes.>>I want to be the person
that gets that message so I can just throw my hands and be like, “I’m going home, my day is done.”>>Exactly, so the opposite
end of that though, is were you have things like, sever name, IP address, how the DNS resolves versus how the machine
has configured itself.>>Correct.>>What else, what else
do you throw in there?>>Who’s responsible for it ultimately. Is it part of a larger web farm. Is this part of a
specific group of things, if it actually is not a, down. Down’s always the simple thing. And down’s really not that
relevant, it’s still relevant, but it’s not as relevant
as it was 15 years ago.>>It’s critical, and critical
can mean many things.>>Yes, but down, yeah, there you go. But down is not sometimes
as important as slow, or critical or, warning
is sketchy subject, we can talk about that. But, it’s one of those things that, as much as I would get
a, love a, no down alert. If it’s part of a larger web farm, I don’t care, as long as, you know, 60% of the rest of the
stuff is up and operational.>>Right, well, okay, and
there’s the other piece that I really like when I see it in there, is the threshold is such-and-such, and it’s currently reading at
such-and-such, at this time. Because someone maybe responding to it, minutes or hours away. So to know that at 45 minutes
ago it was reading at this, but now it’s dropped back
down again, tells me, “Oh, I’m not crazy,” it wasn’t
a false reading, it’s moved. But also the threshold is set here. I like that when that’s in the alert also.>>Yeah, well, I think you
kind of need that though, otherwise the alert is crap. If I tell you, “Hey, your cup is empty,” that’s great, but where
do you go from there?>>Right, how full does it need to be?>>How full does it–
>>Does it need?>>Does it need to be
all the way to the top? And the answer is, “With
these probably not, “’cause we will get the shakes.”>>[laughing] That’s what I’m going for.>>Yeah, okay, good, good,
we’ll get there by the end. But I’ve actually had people
come up to me at times, and be like “Hey, “I got this alert, do I need
to do anything about it?”>>Right, that’s the other piece; is it linked to a knowledge base article? One of the jobs that I was
in an alert couldn’t go live until there was a knowledge
base article associated with it. Even if the knowledge base
article said; call Bob.>>We always talk about, Bob. Poor Bob, Bob is excellent, one of these days we’ll hire a Bob.>>Right, right and you’ll
have to give him a raise. [funky jazz music] [laughing] So, okay, I think that take
care of the first question. The second one is the exact opposite. “Why didn’t I get that alert?”>>Yeah, and that one’s always, see that’s the one that
I’ve always kind of mentally broken down a couple of different ways. ‘Cause, why did they get an alert means, something triggered somewhere.>>Right.>>And for some reason I think
you should know about this. That’s in the strictest way,
that’s the way I look at it. But, “Why didn’t I?” could be, maybe you weren’t supposed to. That’s the basics. It’s like, maybe you weren’t supposed to. This is during a change
window, we’re got muting on, or unmange or whatever you do, and where you’re not
supposed to know about this, or it only alerts during business hours.>>Right, or it doesn’t go to your group. Or it goes to your group during the day, but it goes to somebody
else’s group at night.>>Right, and that’s all, follow the sun, kind of stuff you have to worry about. But, my favorite is, and I have had it. I had my boss come up
to me once, and he come, and he basically like knocked
at the edge of my cube, stuck his head in, and was like, “Hey, is e-mail okay? “Because I didn’t get an e-mail “about the e-mail being down?” And I just kind of, I literally, I’m sitting there and I just
looked up and I just paused. And I did the three counts
of, one, two, three.>>’Cause, maybe he’ll hear himself?>>’Cause sometimes when
you, and it happens. And I’ve done it before,
I’ve said the dumb things. But the fact that he
actually come up to my cube and did that, and then
still had that straight face looking at me. I was like, “I’ve got to let him know, “and I don’t know a polite
way to do this right now,” ’cause I’m in the middle of working on that e-mail down issue. ‘Cause I got it through
a different avenue. That’s the kind of thing
you have to worry about. Sometimes it’s, but sometimes
it’s simpler than that. Sometimes it’s, someone
changed the credentials, or the IP is shifted, or
maybe the DNS name shifted, and that’s how you’re looking at stuff. These things, during
general business practices, those can happen. And you need to account for
that, in either the logic, or even better, in the alert. Say, “Look, you got this
alert, but these are the things “you need to check first,” tying right back to the KB article. Take that back to a
knowledge base and say, “Look, you’ve gotten this,
check this, this, this and this, “and then if none of those work, “then maybe you need to
start working on it.”>>Right, there’s also some
other things that can happen in the environment. For example, I’ve watched
physical to virtual shifts–>>Oh the, P, to, Vs, yeah.>>Yeah, ,P, to, V, happens,
and everybody else is like, “Yeah, look the systems up, everything,” but, it’s a completely different system. Half of the elements
that were on the system don’t exist anymore, and they’re like, how come I didn’t get an alert on CPU? Because the CPU is fundamentally different than it was before. It had two and now it has eight, but they don’t really exist, you know. Things like that, you know.>>My favorite with that
was actually memory. If you actually used, it’s
not called ballooning, but basically where you dynamically say how much memory’s
for the operating system. And it’s like, we used 90%,
all of a sudden an alert gets fired and it like, but it’s virtual and it’s 60 seconds later,
it allocated four gigs more, and all of a sudden
that no longer applies. And things like that,
challenges like that, including containerization
are things you have to think about as IT continues
to mature in that way. ‘Cause we’ve made that our own problems, by building this technology.>>Yeah, definitely.>>Because it used to be, this
is a server, it’s a thing, like in a closet, I can
go up to it and touch it. And when it goes bad, I can go
over to it and touch it again>>And see the blinky lights or not blinky or whatever, right.>>And now it’s like where is this thing? I don’t know, it’s virtual, so it’s on five different
hosts somewhere, maybe? It’s in the cloud, which means it’s someone
else’s server, fundamentally. Or it’s in a container, which means, “Oh, yeah we just basically blow it up, “and start it over constantly.” We’ve to make sure we
continue to keep that in mind.>>So going back to something
you said about the e-mail, that I think a lot of folks
overlook is that, you know, your supply chain for your monograin, what is the complete flow? That that itself has to
be monitored and tested. That you have to know
that you’re messaging all the way from your ingress, all the way to the egress works, which requires you to have
some sort of troubleshooting, testing, monitoring, et cetera, to know. Because that’s another reason
why you won’t get an alert. You know, “Why didn’t I get that,” okay, monitoring doesn’t fail
a lot, when it’s done right. But, it can, you know, again, “How come I didn’t get an e-mail, “that the e-mail server is down?” Also if you’re using
e-mail for your alerts.>>Please, just stop. I mean, for some things, great. But there’s so many other avenues now. I mean, 10, 12 years ago, did we really have that many choices?>>We still had pagers, I was there.>>Yeah, well, I also
had, I had my Blackberry that they can send me direct messages to, but it was still an e-mail underneath. So if your SMTP’s having a problem, that’s why SMTP’s are
the thing to be watching. [funky jazz music] [coffee beans grinding]>>All right, so the second question. I think we’ve covered the
second question [chuckles] well enough, but the third question gets a little bit more in
depth in terms of technique and the thing that you
have to do to set it up. The first two are about just
making sure your environment is sort of solid, but the third one; what is monitored on my system? Not what’s alerting, not whatever, but what is being monitored on my system?>>Me as a server or application owner–>>Or the network team or the wireless team.
>>Not the monitoring engineer.>>Right, what is being monitored requires you to do a little bit more work in the sense that you need
to be able provide reports.>>Yes.>>And the thing is, is
that whatever you do, one size does not fit all,
it doesn’t always even, one size doesn’t fit most.>>No, no that’s why we have
six cups in front of us.>>Right [chuckles], it needs to be a report that, it’s set of techniques
that you know how to do. That you know how to pull data
out of the monitoring system, out of the database, out
of the agent reports, out of whatever it is, that you can say, “All right, first of all,
what are your systems?” If you can not quantify, “my system”, dollar, my systems, all right, four dollar, my systems,
equals, server team, or four dollar, my systems,
equals network team. You need to be able to
quantify those systems.>>And you need to be
able to do it repeatedly. It’s not like a one shot, like every, even if it’s like a long shot, like every six months,
but it shouldn’t be. It should be much more–
>>It won’t be.>>The brilliant part of this
is, is if you build this right, you’ll actually to be
able to give people access so they can check it themselves. So they should be able
to go to a dash board, and say, “What is me,
what do I care about?”>>”Why is that system on my list?” “Well, you told me all the
Windows servers in the, “slash 24 subnet over here,” it’s like, “Oh, who put that there?” Yeah, that’s not the question though. The question is what’s being
monitored and what you need. There’s really two skill checks.>>Yeah.>>First skill check, is Sequel, you need–>>Just enough.>>Right, right, well, okay,
more Sequel skills never hurt. But you don’t need a ton
to be able to do this. You need to be able to pull the data out, and I don’t care what tool you’re using, it’s in a database someplace and SQL–>>Some type of structured query language will get it back out, regardless of what actually
the underpinning of that is.>>Precisely, so you need
to be able to pull it out, and the second skill check
is actually, Wireshark.>>Okay, tell me, explain
to me why Wireshark? Because I know from my network engineering why I need to understand, Wireshark.>>Because a lot of times you
can go into the monitoring tool and you can look at the screens and say, “Oh, well, I’m obviously
pulling CPU, or RAM, or bits per second,
or, you know, whatever, or NetFlow or what have you. It’s a lot of the times, you’re not sure what’s being collected, and you don’t know exactly
where in the database to look. So by putting Wireshark
right on the target device. Right on the interface
of the target device, and listening only for
that traffic and seeing, “Oh, there was just a request via IPSLA, I didn’t know it was doing that. “Oh, there’s and SNMP
request for this OID. What’s that OID?
>>There’s a script execution, something else that’s not
kind of the traditional SNMP, WMI style polling.>>Right, and it’s not, now that’s not something
you gonna do all the time. It’s gonna be, when you
have your sample device, you’re gonna Wireshark
it, you gonna watch it, you gonna get a sense of it. And then you’re gonna
say, okay, now I know, and that goes into the report. But then, pulling all that data
together and combining it–>>And making it digestible for
whoever you’re making it for because I can dump, I
mean, we’ve both done it, where you dump those master,
looks like a huge spreadsheet. And it’s like, “No it
makes perfect sense to me,” but the minute you go to somebody else, they’re gonna be like, “I’m just looking at a bunch of numbers.”>>Right, right.>>That’s not what you, that’s
not doing you due diligence as a monitoring person.>>Right and on top of it, not
only is it, name of system, monitoring CPU, whatever, but
you wanna be able to blend in there, the alerting thresholds. What is warning? What is critical? What is, you know, on this system? Because it could be tuned
per device or per target, or per application. Or per application on per device–>>Number of permutations just spiders, and that’s why you’ve got to be able to pull this report together
because you’ve got to be able, if someone says, “What about this thing?” You can go right down, “There it is right
there, that’s the thing.” [coffee machine whirring]>>It’s alive. [evil laughing]>>More.
[cups clinking]>>Wait a minute, my, my,
my, my, my, my, my, my. You’re gonna touch it.>>How do I get this [mumbles]? [laughing]
[funky jazz music] [laughing] I got to sit [mumbling]–
>>Ting, yeah.>>So there’s the third question. Fourth question is; what we
have affectionately called, The Donkey Hotay Project,” right?>>Yes, yeah, we have, Project Windmill.>>So what will alert for my systems–>>What could potentially, not, hopefully most of these alerts won’t fire, ’cause you built them properly. But what could potentially fire?>>Yes, on again, wanna
get, dollar, my systems, which alerts that, exist, are going to be, are going to, like,
actually potentially fire? And that’s an amazing challenge to know, because nothing’s fired,
everything’s green right now. Or everything’s blue right now.>>Well, you’re unmanaged–>>Yeah, [laughing].>>We’re gonna [mumble].>>You’re unmanageable in
general, it’s what I hear.>>Okay, everything like
in a green right now, so–>>How do I know what
could send me an e-mail, or run a script or do this thing? Getting this list out,
while they’re in this state, that’s a challenge.>>Right, and it’s important though because what you’re talking about, we’ve talked about this before, that an alert represents unscheduled work, unplanned work, right. And you’re actually financially
burdening your company, potentially financially
burdening your company, if you have a particular alert that requires half-an-hour to deal with, and it could occur, it
could alert on 2700 systems. The work amount is incredible on that.>>Yeah, 1300 hours, something like that.>>Something like that, so
we threw together a scrip. You threw together a script.>>I built the script,
well, you found the query.>>Yes.>>You found the query and
the URI that had to be called the verb, it had to be called. And then I took it, and I
had more generalized it, which I know, you told me,
makes it harder to read. And I understand that, but
it also does more at once. And I commented it, you’ve
read through it a little bit.>>Yeah, yeah.>>But my favorite part about that is putting a little debug flag in. So basically just scans
everything and just pulls like, here’s an example of 10 things. So you know whether
your individual queries are working cleanly.>>Right, exactly.>>It’s a heavy workload
because essentially what you’re doing is, you’re saying, this thing that’s green, right now, I need a list of everything
you could possibly trigger on.>>Right, you actually reverse
engineering the alert.>>You’re basically taking those
and flipping them backwards, and since very alert
is essential an query, you now have to, it’s jeopardy for alerts.>>But you know that it can do it because when you’re building an alert, it tells you this alert’s
gonna trigger on 2700 items, so you know it’s possible. Okay, so the fifth question.>>Aha, but that’s not the way it works because you said four questions.>>I know.>>We always say four questions, and we always come up with five.>>Right because when
you’re asking questions it leads to more questions.>>Yes, and answers are
good, but they’re finite.>>Right.>>Questions allow you to explore.>>Questions let you, take you all sort of interesting places. So the fifth question is
actually an interesting one because it creates better
interactions with people. The fifth question is; what
do you monitor standard? And that’s the one that
if you can answer that, if you’re prepared to answer that. All of a sudden people
are going to really enjoy and appreciate what
monitoring has to offer. Okay, we’re gonna play pretend. I’m gonna be the monitoring engineer, you’re gonna request some monitoring, and we’re gonna see how
this conversation goes.>>Oky doky. So, okay, so here is my machine,
I need help monitoring it.>>Okay, well, what would
you like me to monitor?>>Um, wait, but you’re
the monitoring engineer, you should already know
what needs to be monitored.>>You’re the one asking for it
so you have to tell me, duh.>>Okay, all right, so,
whatever your monitor standard is fine I’m sure.>>Anything else? I mean that’s a really
important system you have.>>Well, yeah, but, what standard?>>Oh, come on, we’ve
been doing this for you for like three years now. I mean, there’s, WMI, there’s IPSLA, there’s NetFlow, there’s WMI. And at this point you
probably wanting to punch me in the face because, like, you don’t know.>>Yeah, and we’ve had this
conversation with people before at previous places and
you see their eyes glaze.>>Exactly.>>And you just know, and lost them. So we need to change the way
we’re asking those questions.>>Okay, we’re gonna try that
same conversation again, but this time, I’m able
to provide you with stuff. So again, I’m the monitoring engineer, you’re requesting it.>>Okay.
>>Ready?>>Good, I got this
system I need to monitor.>>Wonderful, now, here is a list of everything that we monitor standard. Is there anything else that
you’d like that we missed?>>Yeah, but I don’t
want all of these alerts coming through all the time.>>Okay, and see, this is
where you have the chance to explain that monitoring
isn’t the same thing as alerting, right. So we go over that conversation.>>Okay, so we have that
conversation back and forth. And then we go, “Okay, cool.” So, “Cool, I wanna monitor all that stuff, “but what about my applications?”>>”Oh, well, here’s everything “that we monitor for applications.”>>”Okay, same deal applies though.”>>”Yes.”>>”I can monitor all this stuff, “but I don’t have to get
alerted for all of them?”>>”Correct.”>>Okay, and that’s the
conversation we should be having with people over and over.
>>Exactly, is there anything additional and so on.>>Yeah.>>See, how much easier it is.>>It is easy, but you’ve
got to be prepared for it.>>Exactly. [mellow jazz music]

Leave a Reply

Your email address will not be published. Required fields are marked *