Nov
16
2022
--

How to Generate Test Data for MongoDB With Python

Generate Test Data for MongoDB With Python

Generate Test Data for MongoDB With PythonFor testing purposes, especially if you’re working on a project that uses any database technology to store information, you may need data to try out your project. In that case, you have two options:

  • Find a good dataset (Kaggle) or
  • Use a library like Faker

Through this blog post, you will learn how to generate test data for MongoDB using Faker.

Requirements

Dependencies

Make sure all the dependencies are installed before creating the Python script that will generate the data for your project.

You can create a requirements.txt file with the following content:

pandas
pymongo
tqdm
faker

Once you have created this file, run the following command:

pip install -r requirements.txt

Or if you’re using Anaconda, create an environment.yml file:

name: percona
dependencies:
  - python=3.10
  - pandas
  - pymongo
  - tqdm
  - faker

You can change the Python version as this script has been proven to work with these versions of Python: 3.7, 3.8, 3.9, 3.10, and 3.11.

Run the following statement to configure the project environment:

conda env create -f environment.yml

Fake data with Faker

Faker is a Python library that can be used to generate fake data through properties defined in the package.

from faker import Faker

fake = Faker()
for _ in range(10):
    print(fake.name())

The above code will print ten names, and on each call to method name(), it will produce a random value. The name() is a property of the generator. Every property of this library is called a fake. and there are many of them packaged in providers.

Some providers and properties available in the Faker library include:

You can find more information on bundled and community providers in the documentation.

Creating a Pandas DataFrame

After knowing Faker and its properties, a modules directory needs to be created, and inside the directory, we will create a module named dataframe.py. This module will be imported later into our main script, and this is where we define the method that will generate the data.

from multiprocessing import cpu_count
import pandas as pd
from tqdm import tqdm
from faker import Faker

Multiprocessing is implemented for optimizing the execution time of the script, but this will be explained later. First, you need to import the required libraries:

  • pandas. Data generated with Faker will be stored in a Pandas DataFrame before being imported into the database.
  • tqdm(). Required for adding a progress bar to show the progress of the DataFrame creation.
  • Faker(). It’s the generator from the faker library.
  • cpu_count(). This is a method from the multiprocessing module that will return the number of cores available.
fake = Faker()
num_cores = cpu_count() - 1

Faker() creates and initializes a faker generator, which can generate data by accessing the properties.

num_cores is a variable that stores the value returned after calling the cpu_count() method.

def create_dataframe(arg):
    x = int(60000/num_cores)
    data = pd.DataFrame()
    for i in tqdm(range(x), desc='Creating DataFrame'):
        data.loc[i, 'first_name'] = fake.first_name()
        data.loc[i, 'last_name'] = fake.last_name()
        data.loc[i, 'job'] = fake.job()
        data.loc[i, 'company'] = fake.company()
        data.loc[i, 'address'] = fake.address()
        data.loc[i, 'city'] = fake.city()
        data.loc[i, 'country'] = fake.country()
        data.loc[i, 'email'] = fake.email()
    return data

Then we define the create_dataframe() function, where:

  • x is the variable that will determine the number of iterations of the for loop where the DataFrame is created.
  • data is an empty DataFrame that will later be fulfilled with data generated with Faker.
  • Pandas DataFrame.loc attribute provides access to a group of rows and columns by their label(s). In each iteration, a row of data is added to the DataFrame and this attribute allows assigning values to each column.

The DataFrame that is created after calling this function will have the following columns:

#   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   first_name  60000 non-null  object
 1   last_name   60000 non-null  object
 2   job         60000 non-null  object
 3   company     60000 non-null  object
 4   address     60000 non-null  object
 5   country     60000 non-null  object
 6   city        60000 non-null  object
 7   email       60000 non-null  object

Connection to the database

Before inserting the data previously generated with Faker, we need to establish a connection to the database, and for doing this the PyMongo library will be used.

from pymongo import MongoClient

uri = "mongodb://user:password@localhost:27017/"
client = MongoClient(uri)

From PyMongo, we import the MongoClient() method.

Don’t forget to replace user, password, localhost, and port (27017) with your authentication details, and save this code in the modules directory and name it as base.py.

What is multiprocessing?

Multiprocessing is a Python module that can be used to take advantage of the CPU cores available in the computer where the script is running. In Python, single-CPU use is caused by the global interpreter lock, which allows only one thread to carry the Python interpreter at any given time, for more information see this blog post.

Imagine that you’re generating 60,000 records, running the script in a single core will take more time than you could expect, since each record is generated one by one within the loop. By implementing multiprocessing, the whole process is divided by the number of cores, so that if your CPU has 16 cores, every core will generate 4,000 records, and this is because only 15 cores will be used as we need to leave one available for avoiding freezing the computer.

To understand better how to implement multiprocessing in Python, I recommend the following tutorials:

Generating your data

All the required modules are now ready to be imported into the main script so it’s time to create the mongodb.py script. First, import the required libraries:

from multiprocessing import Pool
from multiprocessing import cpu_count
import pandas as pd

From multiprocessing, Pool() and cpu_count() are required. The Python Multiprocessing Pool class allows you to create and manage process pools in Python.

Then, import the modules previously created:

from modules.dataframe import create_dataframe
from modules.base import client

Now we create the multiprocessing pool configured to use all available CPU cores minus one. Each core will call the create_dataframe() function and create a DataFrame with 4,000 records, and after each call to the function has finished, all the DataFrames created will be concatenated into a single one.

if __name__ == "__main__":
    num_cores = cpu_count() - 1
    with Pool() as pool:
        data = pd.concat(pool.map(create_dataframe, range(num_cores)))
    data_dict = data.to_dict('records')
    db = client["company"]
    collection = db["employees"]
    collection.insert_many(data_dict)

After logging into the MongoDB server, we get the database and the collection where the data will be stored.

And finally, we will insert the DataFrame into MongoDB by calling the insert_many() method. All the data will be stored in a collection named employees.

Run the following statement to populate the database:

python mongodb.py

DataFrame creation with multiprocessing

DataFrame creation with multiprocessing

It will take just a few seconds to generate the DataFrame with the 60,000 records, and that’s why multiprocessing was implemented.

CPU Utilization on PMM

CPU utilization on Percona Monitoring and Management

Once the script finishes, you can check the data in the database.

use company;
db.employees.count()

The count() function returns the number of records in the employees table.

60000

Or you can display the records in the employees table:

db.employees.find().pretty()

{
        "_id" : ObjectId("6363ceeeda5c972cabf558b4"),
        "first_name" : "Sherri",
        "last_name" : "Phelps",
        "job" : "Science writer",
        "company" : "House Inc",
        "address" : "06298 Mejia Streets Suite 742\nRobertland, WY 98585",
        "city" : "Thomasview",
        "country" : "Cote d'Ivoire",
        "email" : "michelle63@hotmail.com"
}

The code shown in this blog post can be found on my GitHub account in the data-generator repository.

Nov
01
2022
--

How to Generate Test Data for MySQL With Python

Generate Test Data for MySQL With PythonFor testing purposes, especially if you’re working on a project that uses any database technology to store information, you may need data to try out your project. In that case, you have two options:

  • Find a good dataset (Kaggle) or,
  • Use a library like Faker

Through this blog post, you will learn how to generate test data for MySQL using Faker.

Requirements

Dependencies

Make sure all the dependencies are installed before creating the Python script that will generate the data for your project.

You can create a requirements.txt file with the following content:

pandas
sqlalchemy
PyMySQL
tqdm
faker

Once you have created this file, run the following command:

pip install -r requirements.txt

Or if you’re using Anaconda, create an environment.yml file:

name: percona
dependencies:
  - python=3.10
  - pandas
  - sqlalchemy
  - PyMySQL
  - tqdm
  - faker

You can change the Python version as this script has been proven to work with these versions of Python: 3.7, 3.8, 3.9, 3.10, and 3.11.

Run the following statement to configure the project environment:

conda env create -f environment.yml

Database

Now that you have the dependencies installed, you must create a database named company.

Log into MySQL:

mysql -u root -p

Or log into MySQL using MySQL Shell:

mysqlsh root@localhost

Replace root with your username, if necessary, and replace localhost with the IP address or URL for your MySQL server instance if needed.

If using MySQL Shell, change to SQL mode:

\sql

and create the company database

create database company;

Fake data with Faker

Faker is a Python library that can be used to generate fake data through properties defined in the package.

from faker import Faker

fake = Faker()
for _ in range(10):
    print(fake.name())

The above code will print ten names, and on each call to method name(), it will produce a random value. The name() is a property of the generator. Every property of this library is called a fake. and there are many of them packaged in providers.

Some providers and properties available in the Faker library include:

You can find more information on bundled and community providers in the documentation.

Creating a Pandas DataFrame

After knowing Faker and its properties, a modules directory needs to be created, and inside the directory, we will create a module named dataframe.py. This module will be imported later into our main script, and this is where we define the method that will generate the data.

from multiprocessing import cpu_count
import pandas as pd
from tqdm import tqdm
from faker import Faker

Multiprocessing is implemented for optimizing the execution time of the script, but this will be explained later. First, you need to import the required libraries:

  • pandas. Data generated with Faker will be stored in a Pandas DataFrame before being imported into the database.
  • tqdm(). Required for adding a progress bar to show the progress of the DataFrame creation.
  • Faker(). It’s the generator from the faker library.
  • cpu_count(). This is a method from the multiprocessing module that will return the number of cores available.
fake = Faker()
num_cores = cpu_count() - 1

Faker() creates and initializes a faker generator, which can generate data by accessing the properties.

num_cores is a variable that stores the value returned after calling the cpu_count() method.

def create_dataframe(arg):
    x = int(60000/num_cores)
    data = pd.DataFrame()
    for i in tqdm(range(x), desc='Creating DataFrame'):
        data.loc[i, 'first_name'] = fake.first_name()
        data.loc[i, 'last_name'] = fake.last_name()
        data.loc[i, 'job'] = fake.job()
        data.loc[i, 'company'] = fake.company()
        data.loc[i, 'address'] = fake.address()
        data.loc[i, 'city'] = fake.city()
        data.loc[i, 'country'] = fake.country()
        data.loc[i, 'email'] = fake.email()
    return data

Then we define the create_dataframe() function, where:

  • x is the variable that will determine the number of iterations of the for loop where the DataFrame is created.
  • data is an empty DataFrame that will later be fulfilled with data generated with Faker.
  • Pandas DataFrame.loc attribute provides access to a group of rows and columns by their label(s). In each iteration, a row of data is added to the DataFrame and this attribute allows assigning values to each column.

The DataFrame that is created after calling this function will have the following columns:

#   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   first_name  60000 non-null  object
 1   last_name   60000 non-null  object
 2   job         60000 non-null  object
 3   company     60000 non-null  object
 4   address     60000 non-null  object
 5   country     60000 non-null  object
 6   city        60000 non-null  object
 7   email       60000 non-null  object

Connection to the database

Before inserting the data previously generated with Faker, we need to establish a connection to the database, and for doing this the SQLAlchemy library will be used.

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine("mysql+pymysql://user:password@localhost/company")
Session = sessionmaker(bind=engine)

From SQLAlchemy, we import the create_engine() and the sessionmaker() methods. The first one is for connecting to the database, and the second is for creating a session bond to the engine object.

Don’t forget to replace the user, password, and localhost with your authentication details, save this code in the modules directory and name it base.py.

From the documentation, SQLAlchemy uses the mysqlclient library by default, but there are other ones available, including PyMySQL.

# default
engine = create_engine("mysql://scott:tiger@localhost/foo")

# mysqlclient (a maintained fork of MySQL-Python)
engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo")

# PyMySQL
engine = create_engine("mysql+pymysql://scott:tiger@localhost/foo")

According to the maintainer of both mysqlclient and PyMySQL, mysqlclient-python is much faster than PyMySQL, but you should use PyMySQL if:

  • You can’t use libmysqlclient for some reason
  • You want to use monkeypatched socket of gevent or eventlet
  • You want to hack mysql protocol

Database schema definition

The schema of the database can be created through the Schema Definition Language provided by SQLAlchemy, but as we’re only creating one table and importing the DataFrame by calling Pandas to_sql() method, this is not necessary.

When calling to_sql() method, we specify the schema as follows:

from sqlalchemy.types import *

schema = {
    "first_name": String(50),
    "last_name": String(50),
    "job": String(100),
    "company": String(100),
    "address": String(200),
    "city": String(100),
    "country" String(100),
    "email": String(50)
}

Then we pass the schema variable as a parameter to this method.

Save this code in the modules directory with the name schema.py.

What is multiprocessing?

Multiprocessing is a Python module that can be used to take advantage of the CPU cores available in the computer where the script is running. In Python, single-CPU use is caused by the global interpreter lock, which allows only one thread to carry the Python interpreter at any given time, for more information see this blog post.

Imagine that you’re generating 60,000 records, running the script in a single core will take more time than you could expect, since each record is generated one by one within the loop. By implementing multiprocessing, the whole process is divided by the number of cores, so that if your CPU has 16 cores, every core will generate 4,000 records, and this is because only 15 cores will be used as we need to leave one available for avoiding freezing the computer.

To understand better how to implement multiprocessing in Python, I recommend the following tutorials:

Generating your data

All the required modules are now ready to be imported into the main script so it’s time to create the sql.py script. First, import the required libraries:

from multiprocessing import Pool
from multiprocessing import cpu_count
import pandas as pd

From multiprocessing, Pool() and cpu_count() are required. The Python Multiprocessing Pool class allows you to create and manage process pools in Python.

Then, import the modules previously created:

from modules.dataframe import create_dataframe
from modules.schema import schema
from modules.base import Session, engine

Now we create the multiprocessing pool configured to use all available CPU cores minus one. Each core will call the create_dataframe() function and create a DataFrame with 4,000 records, and after each call to the function has finished, all the DataFrames created will be concatenated into a single one.

if __name__ == "__main__":
    num_cores = cpu_count() - 1
    with Pool() as pool:
        data = pd.concat(pool.map(create_dataframe, range(num_cores)))
    data.to_sql(name='employees', con=engine, if_exists = 'append', index=False, dtype=schema)
    with engine.connect() as conn:
        conn.execute("ALTER TABLE employees ADD id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST;")

And finally, we will insert the DataFrame into MySQL by calling the to_sql() method. All the data will be stored in a table named employees.

By calling conn.execute(), a new column named id will be added to the table, set as the primary key, and placed at the beginning.

Run the following statement to populate the table:

python sql.py

DataFrame creation with multiprocessing

DataFrame creation with multiprocessing

It will take just a few seconds to generate the DataFrame with the 60,000 records, and that’s why multiprocessing was implemented.

CPU Utilization on PMM

CPU utilization on Percona Monitoring and Management

Once the script finishes, you can check the data in the database.

use company;
select count(*) from employees;

The count() function returns the number of records in the employees table.

+----------+
| count(*) |
+----------+
|    60000 |
+----------+
1 row in set (0.22 sec)

The code shown in this blog post can be found on my GitHub account in the data-generator repository.

Aug
25
2021
--

Bodo.ai secures $14M, aims to make Python better at handling large-scale data

Bodo.ai, a parallel compute platform for data workloads, is developing a compiler to make Python portable and efficient across multiple hardware platforms. It announced Wednesday a $14 million Series A funding round led by Dell Technologies Capital.

Python is one of the top programming languages used among artificial intelligence and machine learning developers and data scientists, but as Behzad Nasre, co-founder and CEO of Bodo.ai, points out, it is challenging to use when handling large-scale data.

Bodo.ai, headquartered in San Francisco, was founded in 2019 by Nasre and Ehsan Totoni, CTO, to make Python higher performing and production ready. Nasre, who had a long career at Intel before starting Bodo, met Totoni and learned about the project that he was working on to democratize machine learning and enable parallel learning for everyone. Parallelization is the only way to extend Moore’s Law, Nasre told TechCrunch.

Bodo does this via a compiler technology that automates the parallelization so that data and ML developers don’t have to use new libraries, APIs or rewrite Python into other programming languages or graphics processing unit code to achieve scalability. Its technology is being used to make data analytics tools in real time and is being used across industries like financial, telecommunications, retail and manufacturing.

“For the AI revolution to happen, developers have to be able to write code in simple Python, and that high-performance capability will open new doors,” Totoni said. “Right now, they rely on specialists to rewrite them, and that is not efficient.”

Joining Dell in the round were Uncorrelated Ventures, Fusion Fund and Candou Ventures. Including the new funding, Bodo has raised $14 million in total. The company went after Series A dollars after its product had matured and there was good traction with customers, prompting Bodo to want to scale quicker, Nasre said.

Nasre feels Dell Technologies Capital was “uniquely positioned to help us in terms of reserves and the role they play in the enterprise at large, which is to have the most effective salesforce in enterprise.”

Though he was already familiar with Nasre, Daniel Docter, managing director at Dell Technologies, heard about Bodo from a data scientist friend who told Docter that Bodo’s preliminary results “were amazing.”

Much of Dell’s investments are in the early-stage and in deep tech founders that understand the problem. Docter puts Totoni and Nasre in that category.

“Ehsan fits this perfectly, he has super deep technology knowledge and went out specifically to solve the problem,” he added. “Behzad, being from Intel, saw and lived with the problem, especially seeing Hadoop fail and Spark take its place.”

Meanwhile, with the new funding, Nasre intends to triple the size of the team and invest in R&D to build and scale the company. It will also be developing a marketing and sales team.

The company is now shifting from financing to customer- and revenue-focused as it aims to drive up adoption by the Python community.

“Our technology can translate simple code into the fast code that the experts will try,” Totoni said. “I joined Intel Labs to work on the problem, and we think we have the first solution that will democratize machine learning for developers and data scientists. Now, they have to hand over Python code to specialists who rewrite it for tools. Bodo is a new type of compiler technology that democratizes AI.”

 

Apr
06
2021
--

Okta launches a new free developer plan

At its Oktane21 conference, Okta, the popular authentication and identity platform, today announced a new — and free — developer edition that features fewer limitations and support for significantly more monthly active users than its current free plan.

The new ‘Okta Starter Developer Edition,’ as it’s called, allows developers to scale up to 15,000 monthly active users — up from only 1,000 on its existing free plan. In addition, the company is also launching enhanced documentation, a set of sample apps and new SDKs, which now cover languages and frameworks like Go, Java, JavaScript, Python, Vue.js, React Native and Spring Boot.

“Our overall philosophy isn’t, ‘we want to just provide […] a set of authentication and authorization services.’ The way we’re looking at this is, ‘hey, app developer, how do we provide you the foundation you need to get up and running quickly with authorization and authentication as one part of it,’ ” Diya Jolly, Okta’s chief product officer, told me. And she believes that Okta is in a unique position to do so, because it doesn’t only offer tools to manage authorization and access, but also systems for securing microservices and providing applications with access to privileged resources.

Image Credits: Okta

It’s also worth noting that, while the deal hasn’t closed yet, Okta’s intent to acquire Auth0 significantly extends its developer strategy, given Auth0’s developer-first approach.

As for the expanded free account, Jolly noted that the company found that developers wanted to be able to access more of the service’s features during their prototyping phases. That means the new free Developer Edition comes with support for multi-factor authentication, machine-to-machine tokens and B2B integrations, for example, in addition to expanded support for integrations into toolchains. As is so often the case with enterprise tools, the free edition doesn’t come with the usual enterprise support options and has lower rate limits than the paid plans.

Still, and Jolly acknowledged this, a small to medium-sized business may be able to build applications and take them into production based on this new free plan.

“15K [monthly active users] is is a lot, but if you look at our customer base, it’s about the right amount for the smaller business applications, the real SMBs, and that was the goal. In a developer motion, you want people to try out things and then upgrade. I think that’s the key. No developer is going to come and build with you if you don’t have a free offering that they can tinker around and play with.”

Image Credits: Okta

She noted that the company has spent a lot of time thinking about how to support developers through the application development lifecycle overall. That includes better CLI tools for developers who would rather bypass Okta’s web-based console, for example, and additional integrations with tools like Terraform, Kong and Heroku. “Today, [developers] have to stitch together identity and Okta into those experiences — or they use some other identity — we’ve pre-stitched all of this for them,” Jolly said.

The new Okta Starter Developer Edition, as well as the new documentation, sample applications and integrations, are now available at developer.okta.com.

Dec
02
2020
--

Fylamynt raises $6.5M for its cloud workflow automation platform

Fylamynt, a new service that helps businesses automate their cloud workflows, today announced both the official launch of its platform as well as a $6.5 million seed round. The funding round was led by Google’s AI-focused Gradient Ventures fund. Mango Capital and Point72 Ventures also participated.

At first glance, the idea behind Fylamynt may sound familiar. Workflow automation has become a pretty competitive space, after all, and the service helps developers connect their various cloud tools to create repeatable workflows. We’re not talking about your standard IFTTT- or Zapier -like integrations between SaaS products, though. The focus of Fylamynt is squarely on building infrastructure workflows. While that may sound familiar, too, with tools like Ansible and Terraform automating a lot of that already, Fylamynt sits on top of those and integrates with them.

Image Credits: Fylamynt

“Some time ago, we used to do Bash and scripting — and then [ … ] came Chef and Puppet in 2006, 2007. SaltStack, as well. Then Terraform and Ansible,” Fylamynt co-founder and CEO Pradeep Padala told me. “They have all done an extremely good job of making it easier to simplify infrastructure operations so you don’t have to write low-level code. You can write a slightly higher-level language. We are not replacing that. What we are doing is connecting that code.”

So if you have a Terraform template, an Ansible playbook and maybe a Python script, you can now use Fylamynt to connect those. In the end, Fylamynt becomes the orchestration engine to run all of your infrastructure code — and then allows you to connect all of that to the likes of DataDog, Splunk, PagerDuty Slack and ServiceNow.

Image Credits: Fylamynt

The service currently connects to Terraform, Ansible, Datadog, Jira, Slack, Instance, CloudWatch, CloudFormation and your Kubernetes clusters. The company notes that some of the standard use cases for its service are automated remediation, governance and compliance, as well as cost and performance management.

The company is already working with a number of design partners, including Snowflake.

Fylamynt CEO Padala has quite a bit of experience in the infrastructure space. He co-founded ContainerX, an early container-management platform, which later sold to Cisco. Before starting ContainerX, he was at VMWare and DOCOMO Labs. His co-founders, VP of Engineering Xiaoyun Zhu and CTO David Lee, also have deep expertise in building out cloud infrastructure and operating it.

“If you look at any company — any company building a product — let’s say a SaaS product, and they want to run their operations, infrastructure operations very efficiently,” Padala said. “But there are always challenges. You need a lot of people, it takes time. So what is the bottleneck? If you ask that question and dig deeper, you’ll find that there is one bottleneck for automation: that’s code. Someone has to write code to automate. Everything revolves around that.”

Fylamynt aims to take the effort out of that by allowing developers to either write Python and JSON to automate their workflows (think “infrastructure as code” but for workflows) or to use Fylamynt’s visual no-code drag-and-drop tool. As Padala noted, this gives developers a lot of flexibility in how they want to use the service. If you never want to see the Fylamynt UI, you can go about your merry coding ways, but chances are the UI will allow you to get everything done as well.

One area the team is currently focusing on — and will use the new funding for — is building out its analytics capabilities that can help developers debug their workflows. The service already provides log and audit trails, but the plan is to expand its AI capabilities to also recommend the right workflows based on the alerts you are getting.

“The eventual goal is to help people automate any service and connect any code. That’s the holy grail. And AI is an enabler in that,” Padala said.

Gradient Ventures partner Muzzammil “MZ” Zaveri echoed this. “Fylamynt is at the intersection of applied AI and workflow automation,” he said. “We’re excited to support the Fylamynt team in this uniquely positioned product with a deep bench of integrations and a nonprescriptive builder approach. The vision of automating every part of a cloud workflow is just the beginning.”

The team, which now includes about 20 employees, plans to use the new round of funding, which closed in September, to focus on its R&D, build out its product and expand its go-to-market team. On the product side, that specifically means building more connectors.

The company offers both a free plan as well as enterprise pricing and its platform is now generally available.

Nov
12
2020
--

Databricks launches SQL Analytics

AI and data analytics company Databricks today announced the launch of SQL Analytics, a new service that makes it easier for data analysts to run their standard SQL queries directly on data lakes. And with that, enterprises can now easily connect their business intelligence tools like Tableau and Microsoft’s Power BI to these data repositories as well.

SQL Analytics will be available in public preview on November 18.

In many ways, SQL Analytics is the product Databricks has long been looking to build and that brings its concept of a “lake house” to life. It combines the performance of a data warehouse, where you store data after it has already been transformed and cleaned, with a data lake, where you store all of your data in its raw form. The data in the data lake, a concept that Databricks’ co-founder and CEO Ali Ghodsi has long championed, is typically only transformed when it gets used. That makes data lakes cheaper, but also a bit harder to handle for users.

Image Credits: Databricks

“We’ve been saying Unified Data Analytics, which means unify the data with the analytics. So data processing and analytics, those two should be merged. But no one picked that up,” Ghodsi told me. But “lake house” caught on as a term.

“Databricks has always offered data science, machine learning. We’ve talked about that for years. And with Spark, we provide the data processing capability. You can do [extract, transform, load]. That has always been possible. SQL Analytics enables you to now do the data warehousing workloads directly, and concretely, the business intelligence and reporting workloads, directly on the data lake.”

The general idea here is that with just one copy of the data, you can enable both traditional data analyst use cases (think BI) and the data science workloads (think AI) Databricks was already known for. Ideally, that makes both use cases cheaper and simpler.

The service sits on top of an optimized version of Databricks’ open-source Delta Lake storage layer to enable the service to quickly complete queries. In addition, Delta Lake also provides auto-scaling endpoints to keep the query latency consistent, even under high loads.

While data analysts can query these data sets directly, using standard SQL, the company also built a set of connectors to BI tools. Its BI partners include Tableau, Qlik, Looker and Thoughtspot, as well as ingest partners like Fivetran, Fishtown Analytics, Talend and Matillion.

Image Credits: Databricks

“Now more than ever, organizations need a data strategy that enables speed and agility to be adaptable,” said Francois Ajenstat, chief product officer at Tableau. “As organizations are rapidly moving their data to the cloud, we’re seeing growing interest in doing analytics on the data lake. The introduction of SQL Analytics delivers an entirely new experience for customers to tap into insights from massive volumes of data with the performance, reliability and scale they need.”

In a demo, Ghodsi showed me what the new SQL Analytics workspace looks like. It’s essentially a stripped-down version of the standard code-heavy experience with which Databricks users are familiar. Unsurprisingly, SQL Analytics provides a more graphical experience that focuses more on visualizations and not Python code.

While there are already some data analysts on the Databricks platform, this obviously opens up a large new market for the company — something that would surely bolster its plans for an IPO next year.

Oct
14
2020
--

Dataloop raises $11M Series A round for its AI data management platform

Dataloop, a Tel Aviv-based startup that specializes in helping businesses manage the entire data life cycle for their AI projects, including helping them annotate their data sets, today announced that it has now raised a total of $16 million. This includes a $5 seed round that was previously unreported, as well as an $11 million Series A round that recently closed.

The Series A round was led by Amiti Ventures, with participation from F2 Venture Capital, crowdfunding platform OurCrowd, NextLeap Ventures and SeedIL Ventures.

“Many organizations continue to struggle with moving their AI and ML projects into production as a result of data labeling limitations and a lack of real-time validation that can only be achieved with human input into the system,” said Dataloop CEO Eran Shlomo. “With this investment, we are committed, along with our partners, to overcoming these roadblocks and providing next generation data management tools that will transform the AI industry and meet the rising demand for innovation in global markets.”

Image Credits: Dataloop

For the most part, Dataloop specializes in helping businesses manage and annotate their visual data. It’s agnostic to the vertical its customers are in, but we’re talking about anything from robotics and drones to retail and autonomous driving.

The platform itself centers around the “humans in the loop” model that complements the automated systems, with the ability for humans to train and correct the model as needed. It combines the hosted annotation platform with a Python SDK and REST API for developers, as well as a serverless Functions-as-a-Service environment that runs on top of a Kubernetes cluster for automating dataflows.

Image Credits: Dataloop

The company was founded in 2017. It’ll use the new funding to grow its presence in the U.S. and European markets, something that’s pretty standard for Israeli startups, and build out its engineering team as well.

Apr
21
2020
--

Pulumi brings support for more languages to its infrastructure-as-code platform

Seattle-based Pulumi has quickly made a name for itself as a modern platform that lets developers specify their infrastructure through writing code in their preferred programming language — and not YAML. With the launch of Pulumi 2.0, those languages now include JavaScript, TypeScript, Go and .NET, in addition to its original support for Python. It’s also now extending its reach beyond its core infrastructure features to include deeper support for policy enforcement, testing and more.

As the company also today announced, it now has over 10,000 users and more than 100 paying customers. With that, it’s seeing a 10x increase in its year-over-year annual run rate, though without knowing the exact numbers, it’s obviously hard to know what exactly to make of that number. Current customers include the likes of Cockroach Labs, Mercedes-Benz and Tableau .

When the company first launched, its messaging was very much around containers and serverless. But as Pulumi founder and CEO Joe Duffy told me, today the company is often directly engaging with infrastructure teams that are building the platforms for the engineers in their respective companies.

As for Pulumi 2.0, Duffy says that “this is really taking the original Pulumi vision of infrastructure as code — using your favorite language — and augmenting it with what we’re calling superpowers.” That includes expanding the product’s overall capabilities from infrastructure provisioning to the adjacent problem spaces. That includes continuous delivery, but also policy-as-code. This extends the original Pulumi vision beyond just infrastructure but now also lets developers encapsulate their various infrastructure policies as code, as well.

Another area is testing. Because Pulumi allows developers to use “real” programming languages, they can also use the same testing techniques they are used to from the application development world to test the code they use to build their underlying infrastructure and catch mistakes before they go into production. And with all of that, developers can also use all of the usual tools they use to write code for defining the infrastructure that this code will then run on.

“The underlying philosophy is taking our heritage of using the best of what we know and love about programming languages — and really applying that to the entire spectrum of challenges people face when it comes to cloud infrastructure, from development to infrastructure teams to security engineers, really helping the entire organization be more productive working together,” said Duffy. “I think that’s the key: moving from infrastructure provisioning to something that works for the whole organization.”

Duffy also highlighted that many of the company’s larger enterprise users are relying on Pulumi to encode their own internal architectures as code and then roll them out across the company.

“We still embrace what makes each of the clouds special. AWS, Azure, Google Cloud and Kubernetes,” Duffy said. “We’re not trying to be a PaaS that abstracts over all. We’re just helping to be the consistent workflow across the entire team to help people adopt the modern approaches.”

Apr
18
2019
--

MySQL-python: Adding caching_sha2_password and TLSv1.2 Support

python not connecting to MySQL

python not connecting to MySQLPython 2 reaches EOL on 2020-01-01 and one of its commonly used third-party packages is MySQL-python. If you have not yet migrated away from both of these, since MySQL-python does not support Python 3, then you may have come across some issues if you are using more recent versions of MySQL and are enforcing a secured installation. This post will look at two specific issues that you may come across (caching_sha2_password in MySQL 8.0 and TLSv1.2 with MySQL >=5.7.10 when using OpenSSL) that will prevent you from connecting to a MySQL database and buy you a little extra time to migrate code.

For CentOS 7, MySQL-python is built against the client library provided by the mysql-devel package, which does not support some of the newer features, such as caching_sha2_password (the new default authentication plugin as of MySQL 8.0.4) and TLSv1.2. This means that you cannot take advantage of these security features and therefore leave yourself with an increased level of vulnerability.

So, what can we do about this? Thankfully, it is very simple to resolve, so long as you don’t mind getting your hands dirty!

Help! MySQL-python won’t connect to my MySQL 8.0 instance

First of all, let’s take a look at the issues that the latest version of MySQL-python (MySQL-python-1.2.5-1.el7.rpm for CentOS 7) has when connecting to a MySQL 8.0 instance. We will use a Docker container to help test this out by installing MySQL-python along with the Percona Server 8.0 client so that we can compare the two.

=> docker run --rm -it --network host --name rpm-build centos:latest
# yum install -y -q MySQL-python
warning: /var/cache/yum/x86_64/7/base/packages/MySQL-python-1.2.5-1.el7.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID f4a80eb5: NOKEY
Public key for MySQL-python-1.2.5-1.el7.x86_64.rpm is not installed
Importing GPG key 0xF4A80EB5:
 Userid     : "CentOS-7 Key (CentOS 7 Official Signing Key) <security@centos.org>"
 Fingerprint: 6341 ab27 53d7 8a78 a7c2 7bb1 24c6 a8a7 f4a8 0eb5
 Package    : centos-release-7-6.1810.2.el7.centos.x86_64 (@CentOS)
 From       : /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
# yum install -q -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm; percona-release setup ps80
* Enabling the Percona Original repository
<*> All done!
The percona-release package now contains a percona-release script that can enable additional repositories for our newer products.
For example, to enable the Percona Server 8.0 repository use:
  percona-release setup ps80
Note: To avoid conflicts with older product versions, the percona-release setup command may disable our original repository for some products.
For more information, please visit:
  https://www.percona.com/doc/percona-repo-config/percona-release.html
* Disabling all Percona Repositories
* Enabling the Percona Server 8.0 repository
* Enabling the Percona Tools repository
<*> All done!
# yum install -q -y percona-server-client
warning: /var/cache/yum/x86_64/7/ps-80-release-x86_64/packages/percona-server-shared-8.0.15-5.1.el7.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 8507efa5: NOKEY
Public key for percona-server-shared-8.0.15-5.1.el7.x86_64.rpm is not installed
Importing GPG key 0x8507EFA5:
 Userid     : "Percona MySQL Development Team (Packaging key) <mysql-dev@percona.com>"
 Fingerprint: 4d1b b29d 63d9 8e42 2b21 13b1 9334 a25f 8507 efa5
 Package    : percona-release-1.0-11.noarch (installed)
 From       : /etc/pki/rpm-gpg/PERCONA-PACKAGING-KEY

Next we will create a client config to save some typing later on and then check that we can connect to the MySQL instance that is already running; if you don’t have MySQL running on your local machine then you can install it in the container.

# cat <<EOS > /root/.my.cnf
> [client]
> user = testme
> password = my-secret-pw
> host = 127.0.0.1
> protocol = tcp
> ssl-mode = REQUIRED
> EOS
mysql> /* hide passwords */ PAGER sed "s/AS '.*' REQUIRE/AS 'xxx' REQUIRE/" ;
PAGER set to 'sed "s/AS '.*' REQUIRE/AS 'xxx' REQUIRE/"'
mysql> SHOW CREATE USER CURRENT_USER();
CREATE USER 'testme'@'%' IDENTIFIED WITH 'caching_sha2_password' AS 'xxx' REQUIRE SSL PASSWORD EXPIRE DEFAULT ACCOUNT UNLOCK PASSWORD HISTORY DEFAULT PASSWORD REUSE INTERVAL DEFAULT
mysql> SELECT @@version, @@version_comment, ssl_version, ssl_cipher, user FROM sys.session_ssl_status INNER JOIN sys.processlist ON thread_id = thd_id AND conn_id = CONNECTION_ID();
8.0.12-1        Percona Server (GPL), Release '1', Revision 'b072566'   TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256     testme@localhost
All looks good here, so now we will check that MySQLdb can also connect:
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
Traceback (most recent call last):
  File "", line 1, in
  File "/usr/lib64/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2059, "Authentication plugin 'caching_sha2_password' cannot be loaded: /usr/lib64/mysql/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directory")

Changing the user’s authentication plugin

We have hit the first issue, because MySQL 8.0 introduced the caching_sha2_password plugin and made it the default authentication plugin, so we can’t connect at all. However, we can gain access by changing the grants for the user to use the old default plugin and then test again.

mysql> ALTER USER 'testme'@'%' IDENTIFIED WITH mysql_native_password BY "my-secret-pw";
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
Traceback (most recent call last):
  File "", line 1, in
  File "/usr/lib64/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (1045, "Access denied for user 'testme'@'localhost' (using password: YES)")

Configuring SSL options

We still can’t connect, so what can be wrong? Well, we haven’t added any SSL details to the config other than specifying that we need to use SSL, so we will add the necessary options and make sure that we can connect.

# cat <<EOS >> /root/.my.cnf
> ssl-ca = /root/certs/ca.pem
> ssl-cert = /root/certs/client-cert.pem
> ssl-key = /root/certs/client-key.pem
> EOS
# mysql -Bse "select 1"
1
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
(('8.0.12-1', "Percona Server (GPL), Release '1', Revision 'b072566'", 'TLSv1', 'ECDHE-RSA-AES256-SHA', 'testme@localhost'),)

Forcing TLSv1.2 or later to further secure connections

That looks much better now, but if you look closely then you will notice that the MySQLdb connection is using TLSv1, which will make your security team either sad, angry or perhaps both as the connection can be downgraded! We want to secure the installation and keep data over the wire safe from prying eyes, so we will remove TLSv1 and also TLSv1.1 from the list of versions accepted by MySQL and leave only TLSv1.2 (TLSv1.3 is not supported with our current MySQL 8.0 and OpenSSL versions). Any guesses what will happen now?

mysql> SET PERSIST_ONLY tls_version = "TLSv1.2"; RESTART;
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
Traceback (most recent call last):
  File "", line 1, in
  File "/usr/lib64/python2.7/site-packages/MySQLdb/__init__.py", line 81, in Connect
    return Connection(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/MySQLdb/connections.py", line 193, in __init__
    super(Connection, self).__init__(*args, **kwargs2)
_mysql_exceptions.OperationalError: (2026, 'SSL connection error: error:00000001:lib(0):func(0):reason(1)')

MySQLdb can no longer connect to MySQL, sadly with a rather cryptic message! However, as we made the change that triggered this we don’t need to decipher it and can now start looking to add TLSv1.2 support to MySQLdb, so roll up your sleeves!

Solution: Build a new RPM

In order to build a new RPM we will need to do a little extra work in the container first of all, but it doesn’t take long and is pretty simple to do. We are going to install the necessary packages to create a basic build environment and then rebuild the MySQL-python RPM from its current source RPM.

## Download packages
# yum install -q -y rpm-build yum-utils gnupg2 rsync deltarpm gcc
Package yum-utils-1.1.31-50.el7.noarch already installed and latest version
Package gnupg2-2.0.22-5.el7_5.x86_64 already installed and latest version
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
## Create build directory tree
# install -d /usr/local/builds/rpmbuild/{BUILD,RPMS,SOURCES,SPECS,SRPMS}
## Configure the RPM macro
# echo "%_topdir /usr/local/builds/rpmbuild" > ~/.rpmmacros
## Switch to a temporary directory to ease cleanup
# cd "$(mktemp -d)"; pwd
/tmp/tmp.xqxdBeLkXs
## Download the source RPM
# yumdownloader --source -q MySQL-python
Enabling updates-source repository
Enabling base-source repository
Enabling extras-source repository
## Extract the source RPM
# rpm2cpio "$(ls -1 MySQL-python*src.rpm)" | cpio -idmv
MySQL-python-1.2.5.zip
MySQL-python-no-openssl.patch
MySQL-python.spec
234 blocks

We are now ready to start making some changes to the source code and build specifications, but first of all we need to take note of another change that occurred in MySQL 8.0. Older code will reference my_config.h, which has since been removed and is no longer required for building; the fix for this will be shown below.

## Adjust the spec file to use percona-server-devel and allow a build number
# sed -i "s/mysql-devel/percona-server-devel/g; s/Release: 1/Release: %{buildno}/" MySQL-python.spec
## Store the ZIP filename and extract
# SRC_ZIP="$(ls -1 MySQL-python*.zip)"; unzip "${SRC_ZIP}"
## Store the source directory and remove the include of my_config.h
# SRC_DIR=$(find . -maxdepth 1 -type d -name "MySQL-python*"); sed -i 's/#include "my_config.h"/#define NO_MY_CONFIG/' "${SRC_DIR}/_mysql.c"
## View our _mysql.c change
# fgrep -m1 -B3 -A1 -n NO_MY_CONFIG "${SRC_DIR}/_mysql.c"
41-#if defined(MS_WINDOWS)
42-#include
43-#else
44:#define NO_MY_CONFIG
45-#endif
## Update the source
# zip -uv "${SRC_ZIP}" "${SRC_DIR}/_mysql.c"
updating: MySQL-python-1.2.5/_mysql.c   (in=84707) (out=17314) (deflated 80%)
total bytes=330794, compressed=99180 -> 70% savings

Now that we have adjusted the source code and specification we can start work on the new package so that we can once again connect to MySQL.

## Sync the source to the build tree
# rsync -ac ./ /usr/local/builds/rpmbuild/SOURCES/
## Copy the new specification file to the build tree
# cp -a MySQL-python.spec /usr/local/builds/rpmbuild/SPECS/
## Build a new source RPM with our updates
# rpmbuild --define "buildno 2" -bs /usr/local/builds/rpmbuild/SPECS/MySQL-python.spec
Wrote: /usr/local/builds/rpmbuild/SRPMS/MySQL-python-1.2.5-2.el7.src.rpm
## Use the new source RPM to install any missing dependencies
# yum-builddep -q -y /usr/local/builds/rpmbuild/SRPMS/MySQL-python-1.2.5-2.el7.src.rpm &>>debug.log
## Build a new RPM
# rpmbuild --define "buildno 2" --rebuild /usr/local/builds/rpmbuild/SRPMS/MySQL-python-1.2.5-2.el7.src.rpm &>>debug.log
# tail -n1 debug.log
+ exit 0

All went well and so we can now install the new package and see if it worked.

## Install the local RPM
# yum localinstall -q -y /usr/local/builds/rpmbuild/RPMS/x86_64/MySQL-python-1.2.5-2.el7.x86_64.rpm
## Check to see which libmysqlclient is used
# ldd /usr/lib64/python2.7/site-packages/_mysql.so | fgrep libmysqlclient
        libmysqlclient.so.21 => /usr/lib64/mysql/libmysqlclient.so.21 (0x00007f61660be000)
## Test the connection
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
(('8.0.12-1', "Percona Server (GPL), Release '1', Revision 'b072566'", 'TLSv1.2', 'ECDHE-RSA-AES128-GCM-SHA256', 'testme@localhost'),)

Almost there… now force user authentication with the caching_sha2_password plugin

Hurrah! We can once again connect to the database and this time we are using TLSv1.2 and have thus increased our security. There is one thing left to do now though. Earlier on we needed to change the authentication plugin, so we will now change it back for extra security and see if all is still well.

mysql> ALTER USER 'testme'@'%' IDENTIFIED WITH caching_sha2_password BY "my-secret-pw";
# /usr/bin/python -c "import MySQLdb; dbc=MySQLdb.connect(read_default_file='~/.my.cnf').cursor(); dbc.execute('select @@version, @@version_comment, ssl_version, ssl_cipher, user from sys.session_ssl_status inner join sys.processlist on thread_id = thd_id and conn_id = connection_id()'); print dbc.fetchall()"
(('8.0.12-1', "Percona Server (GPL), Release '1', Revision 'b072566'", 'TLSv1.2', 'ECDHE-RSA-AES128-GCM-SHA256', 'testme@localhost'),)

Mission successful! Hopefully, if you are finding yourself in need of a little extra time migrating away from MySQLdb then this will help.


Image modified from photo by David Clode on Unsplash

Feb
11
2019
--

Google Docs gets an API for task automation

Google today announced the general availability of a new API for Google Docs that will allow developers to automate many of the tasks that users typically do manually in the company’s online office suite. The API has been in developer preview since last April’s Google Cloud Next 2018 and is now available to all developers.

As Google notes, the REST API was designed to help developers build workflow automation services for their users, build content management services and create documents in bulk. Using the API, developers can also set up processes that manipulate documents after the fact to update them, and the API also features the ability to insert, delete, move, merge and format text, insert inline images and work with lists, among other things.

The canonical use case here is invoicing, where you need to regularly create similar documents with ever-changing order numbers and line items based on information from third-party systems (or maybe even just a Google Sheet). Google also notes that the API’s import/export abilities allow you to use Docs for internal content management systems.

Some of the companies that built solutions based on the new API during the preview period include Zapier, Netflix, Mailchimp and Final Draft. Zapier integrated the Docs API into its own workflow automation tool to help its users create offer letters based on a template, for example, while Netflix used it to build an internal tool that helps its engineers gather data and automate its documentation workflow.

 

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com