Recruiting the Talent. Staffing the Culture. Call Us: 847-945-7600

Job Seekers

Recruiting the Talent.

  • Share this Job

Senior Site Reliability Engineer

Location : remote
Job Type : Direct
Hours : Full Time
Required Years of Experience : 8+
Required Education : Bachelor's Degree
Travel : No
Relocation : No

Job Description :

We are a tech company, operating a thriving and growing broadcast platform, Alexa ranked in the top 100 sites internationally, and the top 25 in the United States, with approximately 10 million daily users, and a worldwide community of fans. Independent Broadcasters use our platform to create and share live streaming video, photographs, and similar content, generally adult in nature, (but no adult content is required).


 


Site stats you will improve:



  • 728+ Nvidia P100/T4 GPUs

  • 32k+ physical cores over 24 carrier hotels and 6Tbps capacity

  • 10k+ concurrent live video broadcasts

  • 400k+ concurrent live video streams

  • 26B+ weekly web requests

  • 95% of web requests completed in 59ms-72ms

  • 2M database queries per minute, average response 3.5ms

  • 300k+ cmd/sec Redis Clusters



What you will do:



  • Performance analysis to identify sources of instability using data from APM and distributed telemetry data tools

  • Analyze complex systems to identify operational surprises and minimize downtime.

  • Software engineering and patching in to incrementally improve performance, scalability, and reliability

  • Infrastructure modifications in both a data center metal environment with advanced routing/switching and in the public cloud

  • Predictive failure analysis and disaster planning

  • Author new tools and automation to streamline the devops pipeline

  • Collaborate with Frontend/Backend engineering, QA, DevSecOps, and Data teams

  • Database and kv store administration and configuration with a focus on uptime and performance

  • Incident response and postmortem reports



Required Qualifications :

What you bring:



  • STEM degree and relevant experience as a Site Reliability Engineer

  • Exceptional problem solving skills

  • High proficiency in one of the following: C, C++, Java, Python, Go, etc.

  • High proficiency in Unix/Linux environment, excellent knowledge of internals (e.g., filesystems, system calls)

  • Networking knowledge (e.g., routing, switching, TCP stack) for both metal and cloud (VPC, Security Groups) environments

  • Experience in database administration and configuration.

  • Experience with DevOps tools such as Ansible, Docker, Kubernetes,

  • On call reporting to monitoring and alerting of core website functions as needed

  • Experience in growing data center teams (nice to have)


Powered by AkkenCloud
Back to Top