The unprecedented growth of Machine Learning is already disrupting the way the industry runs. This is fueled by the data getting generated — in the present time, we are producing 2.5 quintillion bytes of data on a daily basis. However, for any business, data doesn’t hold much value unless it is beneficial to the company. Data not only aids in better decision-making, but also directly impacts the bottom line as data-backed decisions bring in financial gain. A research by Harvard Business Review says that businesses in the top third of their domain that apply data-driven decisions are on an average 5 percent more productive and 6 percent more profitable than their competitors.
Considering the immense value of machine learning, there is huge demand for competent machine learning professionals. According to LinkedIn’s report that focused on fastest growing jobs in the US using site’s data (from 2012 to 2017), the top two spots were machine learning jobs, which grew by 9.8X in the past five years, and data scientist, which grew 6.5X since 2012.
While the job market continues to grow, the demand for machine learning directly results from the shortage of talent. Hence, it would be very interesting to look at the type of skills that someone needs to master in order to become a machine learning engineer.
Since JobsPikr extracts job data from some of the popular job boards, we selected the job listings posted in the last 3 month from Dice.com. The next step involved segregating the job ads with job title as “machine learning”. Finally we got a data set of more than 11,000 job listings for machine learning in the US region.
In order to analyze the skills required for this role, we found out the terms present in the “job requirement” section of the job ad.
The next step for us was to count the number of terms of various skills and calculate the percentage of occurrence of these skills in the total number of job listings. Given below is the chart that shows the key skills found in the job ads for “machine learning”:
Let’s now go through these skills sequentially.
Python has garnered a lot of interest in the last few years as a choice of language for machine learning engineers. Here the factors that make it popular:
- Open Source – a free to install programming language
- Rich community
- Lower learning curve
- Powerful libraries for data analytics
- Easier integration with databases
Although ‘data’ is not a skill per se, we’ve included it in this list since it bagged the second spot in terms of occurrence in ‘job requirements’ owing to presence of data analysis, data mining, data modelling, data science, etc. Clearly machine learning practitioners need to be skilled in various analytical and statistical applications of data which is directly linked to knowledge of mathematics.
Since Java is a very old programming language and highly adopted in the operational analytics space, it is quite evident that many enterprises already have systems developed with this language. Hence, the models are written in Java as it will be easier to integrate. Apart from that, leading Big Data frameworks/tools like Spark, Hive, and Hadoop are written in Java. It is also a great choice when it comes to scalability and speed.
C/C++ is also used write models (just like Java) and it is critical for developing the algorithmic extensions for R and Python.
This is the open source framework developed by Google Brain Team for machine learning and deep neural networks research. Aspiring machine learning engineers looking to work on neural networks must give preference to this as it has become the go-to framework.
Hadoop has gained massive popularity and has become the de facto open source software for reliable, scalable, distributed computing involving big data analytics.
Structured Query Language (SQL) is also very important for machine learning engineers as it is the standard language to communicate with relational database management systems (RDBMS). People working in this field, will need to write both simple and complex queries to select data from tables apart from having an understanding of different data formats for data management and filtering.
Apache Spark is open source and it has the ability to keep data resident in memory, which results in faster iterative machine learning workloads. In addition to this, since its base is on Scala, it receives significant boost in adoption. Also, the in-built machine learning library, MLlib is very fast and ships with majority of the machine learning and statistical algorithms to simplify large scale ML pipelines. Some of the examples are the following:
- Summary statistics, correlations, stratified sampling, hypothesis testing, random data generation
- Classification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification
- Collaborative filtering techniques including alternating least squares (ALS)
- Cluster analysis methods including k-means, and latent Dirichlet allocation (LDA)
- Dimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)
- Feature extraction and transformation functions
- Optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)
NLP or Natural Language Processing seems like a major application of ML that companies are working on. This is very important since NLP is moving from explicit entity recognition and linking to implicit entity recognition (the contextual information) in an automated manner via ML.
R is a powerful language developed in the early 90’s; currently it is used widely for data science, analysis and statistical computing. It has become very popular because of the following factors:
- Wide range of libraries
- Strong online community
- Open source
- Lower learning curve
ML engineers always need to work on very large data sets and in case of a JVM-centric stack, they will be using Scala. Many of the high performance machine learning frameworks are written using Scala owing to its amazing concurrency support.
Clearly Amazon has surpassed Google and Microsoft when it comes to providing on-demand cloud computing platforms. AWS has made tremendous progress in offering ML-based solutions to machine learning engineers; for instance they offer tools that can create ML models for complex algorithms as an out-of-the-box feature and deliver the predictions of the application via APIs. Last year, they also launched Amazon Comprehend which solves NLP problems via language detection, entity detection, sentiment analysis and topic modelling. Apart from English, it supports French, German, Italian, Portuguese, and Spanish texts.
Although Tensorflow has surged ahead of Caffe as a deep learning framework primarily because of its programmatic approach towards network creation (makes it easier for people from programming background), Caffe is continuously gaining ground as it is more suitable for computationally constrained platforms such as mobile phones.
Although MATLAB is not as popular as R or Python in the analytics space, it still has a lot of traction in the academia. Also it is worth noting that it is a commercial app with high cost and good customer support.
This is a free ML library specifically designed for Python language. It features various classification, regression and clustering algorithms including support vector machines, random forests, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Given the abundance of Python in ML space, it is expected to feature in the top 15 skills.
This sums up the overview of the important skills a machine learning practioner must acquire for better opportunities in career. If you would add any other skill or the reason behind learning a particular skill, do share with us via comments.