LinkedIn today provided a glimpse into Nurse, a piece of software that engineers built to help automate the process of solving IT infrastructure issues. The idea is to increase the efficiency of operations people at the business-oriented social networking company.
“The design of Nurse is simple. Nurse acts as a broker between many systems,” LinkedIn site reliability engineer Brian Cory Sherwin wrote in a blog post on the tool. “Our monitoring system posts requests for remediation workflows to our remediation broker. We’ve implemented integrations with our code deployment system, ticketing system, remote execution system, and virtually any other system we can write integration into. We allow our site reliability engineers and operations engineers to combine any number of workflow actions into the systems we provide integrations for. These actions are the steps our engineers would perform to resolve the alert.”
Other companies have sought to automate their operations work. Facebook has built a tool called FBAR. IBM in the past has talked about “autonomic computing,” and startup StackStorm uses the slogan “self-driving data center.”
Like Facebook and other web companies, LinkedIn, with 364 million members, has built its own tools for a wide range of purposes. Sherwin did not say anything about whether LinkedIn will open-source Nurse, but the company has done that before for tools it has built in house.
LinkedIn started using Nurse in April 2014, and it proved useful in beta testing.
“A significant power disruption occurred and took many servers offline,” Sherwin wrote. “One team had converted most of their monitoring to Nurse and was able to restore their entire stack within minutes of power restoration. The other impacted teams had to identify the servers through monitoring and issue the restoration commands manually.”
As a result of all this, engineering talent at LinkedIn is seeing the effects of automation.
“The auto-remediation system saves time and gives our team the opportunity to build new skills and explore new roles,” Sherwin wrote. “For some, they can transition into site reliability engineers; for others, they’ll create new opportunities around in-depth site health and troubleshooting.”