📉📊📈Top 6 tools to analyze data of any nature
DataRobot is a tool for scaling machine learning capabilities. Contains a massive library of open source and in-house models. Solves complex problems in the field of Data Science. Delivers fully explainable AI through human-friendly visual representations. The downside is the price, but a free trial is available
Alteryx combines analytics, machine learning, data science and process automation into a single end-to-end platform. The platform accepts data from hundreds of other data stores (including Oracle, Amazon, and Salesforce), allowing you to spend more time analyzing and less searching. Alteryx allows you to quickly prototype machine learning models and pipelines using automated model training blocks. It helps you easily visualize data throughout the entire problem solving and modeling journey.
H2O is an open source distributed memory machine learning tool with linear scalability. It supports almost all popular statistical and machine learning algorithms, including generalized linear models, deep learning, and gradient boosted machines. H2O takes data directly from Spark, Azure, Spark, HDFS, and various other sources into its distributed in-memory key value store.
SPSS Statistics - designed to solve business and research problems through detailed analysis, hypothesis testing and predictive analytics. SPSS can read and write data from spreadsheets, databases, ASCII text files, and other statistical packages. It can read and write data to external relational database tables via SQL and ODBC.
RapidMiner - supports all stages of the machine learning method, including data preparation, result visualization, model validation, and optimization. In addition to its own collection of datasets, RapidMiner provides several options for creating a database in the cloud to store huge amounts of data. It is possible to store and load data from various platforms such as NoSQL, Hadoop, RDBMS, etc.
Weka is a set of visualization tools and algorithms for data analysis and predictive modeling. All of them are available free of charge under the GNU General Public License. Users can experiment with their datasets by applying different algorithms to see which model gives the best result. They can then use visualization tools to explore the data.
🔥💥Sufficiently useful web data analytics platforms today
Segment is a web platform and API for web analytics and collecting user data to send it to hundreds of tools or data stores. With Segment, you can export data to any internal system and application, play historical data, view real-time events, such as when someone makes a purchase on a website or application.
Metabase is an open source business intelligence tool. Users ask questions about the data, and Metabase displays the answers in meaningful formats like a bar chart or table. Data questions are saved and grouped into informative dashboards that the entire team uses.
Matomo is a web analytics platform that includes a built-in tag manager that allows you to monitor and control the performance of various marketing campaigns. Features include custom data storage, SAML and LDAP authentication, activity logs, media analytics, and custom reports.
SimilarWeb is a cloud-based website traffic analysis platform. Features include data export, performance metrics, custom dashboards, and conversion analysis. Marketing teams benefit from the ability to view demographic data, analyze customer behavior, and discover new opportunities.
Amplitude is a popular product analytics suite that tracks website visitors through collaborative analysis. The software uses custom behavior reports and notifications to offer a better understanding of how visitors interact, as well as provide actionable insights to speed up product development. Amplitude also allows you to define product strategies, improve customer engagement and optimize conversions.
💥⚡️Data markup services today
1. Hasty.ai - this platform has a lot of "smart" tools like GrabCut, Contour and Dextr that recognize the edges or contours of objects, which can be manually adjusted with a threshold value for the best segmentation Images. It also supports markup prediction after annotating enough data. The second feature of the platform is the ability to train your own object recognizer, semantic segmentation and object segmentation. The only drawback is that processing takes time (up to 10-20 seconds), and it could be spent on the markup itself.
2. Superannotate is a Silicon Valley startup that provides vector annotations (rectangles, polygons, lines, ellipses, pattern keypoints, and cuboids) and pixel-by-pixel brush annotation. The best part of this tool is the super pixel function. It is able to recognize the edges of objects with extremely high accuracy, which greatly speeds up semantic and object segmentation compared to other tools. The only problem is that if the boundaries between the subject and the background are fuzzy, she spends more time manipulating the segments than doing the actual work.
3. Datasaur is a data markup program that focuses on text markup. If you need a data markup tool for natural language processing, then this is a pretty interesting option.
4. Clarifai - provides many useful opportunities for AI training. It can mark up data in pictures, videos and text.
5. V7 Darwin - this tool is actively used for annotating images. It allows you to recognize any area or object. It can even be used in videos to automatically annotate keyframes.
😎🔎Several useful geodata Python libraries
gmaps is a library for working with Google maps. Designed for visualization and interaction with Google geodata.
Leafmap is an open source Python package for creating interactive maps and geospatial analysis. It allows you to analyze and visualize geodata in a few lines of code in the Jupyter environment (Google Colab, Jupyter Notebook and JupyterLab)
ipyleaflet is an interactive widget library that allows you to visualize maps
Folium is a Python library for easy geodata visualization. It provides a Python interface to leaflet.js, one of the best JavaScript libraries for creating interactive maps. This library also allows you to work with GeoJSON and TopoJSON files, create background cartograms with different palettes, customize tooltips and interactive inset maps.
geopandas is a library for working with spatial data in Python. The main object of the GeoPandas module is the geodataframe, which is exactly the same as the Pandas dataframe definition, but also includes a Geometry field that contains the definition of the feature.
😳😱Sber has published a dataset for emotion recognition in Russian
Dusha is a huge open dataset for recognizing emotions in oral speech in Russian.
Dusha is suitable for recognizing emotions in oral speech in Russian. The dataset consists of over 300,000 audio recordings with transcripts and emotional tags. The duration is about 350 hours of audio. The team chose four main emotions that usually appear in a dialogue with a voice assistant: joy, sadness, anger, and a neutral emotion. Since each record was marked up by several annotators, who from time to time still performed various control tasks, the markup turned out to be about 1.5 million records.
Read more about the Dusha dataset in the publication: https://habr.com/ru/companies/sberdevices/articles/715468/
DS books for the newest ones
1. Data science. John Kelleher, Brendan Tierney - the book covers the main aspects, from the moment of setting up data collection and analysis, to addressing the ethical revelations that are growing due to privacy policies. The reader will walk you through how to run neural networks and machine learning, and guide you through case studies of business problems and how to solve them. Additionally, they will talk about technical requirements that can be transferred to a greater extent.
2. Practical statistics for Data Science specialists. Peter Bruce, Bruce Bruce - A hands-on textbook presented for data scientists with programming language skills and familiarity with the definition of mathematical statistics. Here, in an accessible way, the main points from the statistics of data science are presented, as well as an explanation of what are the important needs and sides of data analysis.
3. We study the spark. Holden Karau, Matei Zachariah, Patrick Wendell, Andy Konwinski - The authors of the books are the developers of the Spark system. They will talk about the analysis of the execution of tasks with a few lines of code, as well as understand the scheme through examples.
4. Data science. Data science from scratch. Joel Gras - Joel Gras talks about the Python language, elements of linear algebra, mathematical statistics, methods for collecting, normalizing and processing data. Additionally, it provides an information base for machine learning. Describes mathematical models and ways to develop them according to the "k" recipe.
5. Fundamentals of Data Science and Big Data. Davy Silen, Arno Meisman, Mohamed Ali - Readers are introduced to theoretical framework, machine learning sequencing, working with large datasets, NoSQL, detailed text analysis and computational information. Examples are Data Science scripts in Python.
😳Dataset published for training an improved alternative to LLaMa
A group of researchers from various organizations and universities (Together, ontocord.ai, ds3lab.inf.ethz.ch, crfm.stanford.edu, hazyresearch.stanford.edu, mila.quebec) are working on an open-source alternative to the LLaMa model and have already published dataset relevant to the one used to create the last one. The non-free but well-balanced LLaMa has been used as the basis for projects such as Alpaca, Vicuna and Koala.
Now RedPajama is published in the public domain - a dataset of texts containing more than 1.2 trillion tokens. The next step, according to the developers, will be the creation of the model itself, which will require serious computing power.
🧐What is Data observability: Basic Principles
Data observability is a new level in the modern data processing stack, providing data teams with transparency and quality. The goal of data observability is to reduce the chance of errors in business decisions due to incorrect information in the data.
Observability is ensured by the following principles:
Freshness indicates how fresh data structures are.
Distribution tells you if the data falls within the expected range.
Volume involves understanding the completeness of data structures and the state of data sources.
The schema allows you to understand who and when makes changes to data structures.
Lineage maps upstream data sources to downstream data sinks, helping you determine where errors or failures occurred.
More about data observability in the source: https://habr.com/ru/companies/otus/articles/559320/
😳A selection of Python libraries for random generation of test data
Many people love Python for its convenience in data processing. But sometimes it happens that it is necessary to write and test an algorithm on a certain topic, but there is little or no data on this topic in the public domain. For such purposes, there are libraries with which you can generate fake data with the desired types.
Faker is a library for generating various types of random information. It also has an intuitive API. There are also implementations for languages such as PHP, Perl, and Ruby.
Mimesis is a Python library that helps generate data for various purposes. The library is written using the tools included in the Python standard library, so it does not have any third-party dependencies.
Radar - Library for generating random dates and times
Fake2db - a library that allows you to generate data directly in the database (there are also engines for different DBMS).
😎Searching for data and learning SQL at the same time is easy!!!
Census GPT is a tool that allows users to search for data about cities, neighborhoods, and other geographic areas.
Using artificial intelligence technology, Census-GPT organized and analyzed huge amounts of data to create a superdatabase. Currently, the Census-GPT database contains information about the United States, where users can request data on population, crime rates, education, income, age, and more. In addition, Census-GPT can display US maps in a clear and concise manner.
On the Census GPT site, users can also improve existing maps. The data results can be retrieved along with the SQL query. Accordingly, you can learn SQL and automatically test yourself on real examples.
🤓What is synthetic data and why is it used?
Synthetic data is artificial data that mimics observations of the real world and is used to prepare machine learning models when obtaining real data is not possible due to complexity or cost. Synthesized data can be used for almost any project that requires computer simulation to predict or analyze real events. There are many reasons why a business might consider using synthetic data. Here are some of them:
1. Efficiency of financial and time costs. If a suitable dataset is not available, generating synthetic data can be much cheaper than collecting real world event data. The same applies to the time factor: synthesis can take a matter of days, while collecting and processing real data sometimes takes weeks, months or even years.
2. Research of rare data. In some cases, the data is rare or there is danger in collecting it. An example of sparse data would be a set of unusual fraud cases. An example of dangerous real-world data is traffic accidents, which self-driving cars must learn to respond to. In this case, they can be replaced by synthetic accidents.
3. Eliminate privacy issues. When it is necessary to process or transfer sensitive data to third parties, confidentiality issues should be taken into account. Unlike anonymization, synthetic data generation removes any trace of real data identity, creating new valid datasets without sacrificing privacy.
4. Ease of layout and control. From a technical point of view, fully synthetic data simplifies markup. For example, if an image of a park is generated, it is easy to automatically label trees, people, and animals. You don't have to hire people to manually lay out these objects. In addition, fully synthesized data is easy to control and modify.
🤔Data Mesh Architecture: Essence and Troubleshooting
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. The main idea is interdisciplinary teams that publish and consume Data products, thereby significantly increasing the efficiency of using data.
The essence of the concept of this technology is as follows:
Data domains. In Data Mesh, a data domain is a way to define where enterprise data begins and ends. The boundaries depend on the company and its needs. Sometimes it makes sense to model domains by considering business processes or source systems.
The self-service platform in Data Mesh is built by broad-based experts who create and manage versatile products. As part of this approach, you will rely on decentralization and agreement with business users who understand the subject area, what value certain data has. In this case, you will have specialized teams that develop stand-alone products that do not depend on a central platform.
Federated governance - when moving to a self-service distributed data platform, you need to focus on Governance. If you do not pay attention to it, it is possible to find yourself in a situation where disparate technologies are used in all domains, and data is duplicated. Therefore, both at the platform level and at the data level, you need to implement automated policies.
Data products are an important component of the Data Mesh, related to the application of product thinking to data. For a Data Product to work, it must deliver long-term value to users and be usable, valuable, and tangible. It can be implemented as an API, report, table, or dataset in a data lake.
With all this, the Data Mesh architecture has several problems:
Budget Constraints - The financial viability of the new platform project is threatened by several factors. In particular, this is the inability to pay for infrastructure, the development of expensive applications, the creation of Data products or the maintenance of such systems. For example, if the platform development team manages to create a tool that closes a technical gap, but the volume of data and the complexity of Data products continue to grow, the price of the solution may be too high.
Lack of technical skills - Delegating full data ownership to domains means they have to take the project seriously. Perhaps they will hire new employees or undergo training themselves, but it is possible that soon the requirements will be overwhelming for them. When performance drops drastically, problems can constantly appear here and there.
Data Products Monitoring - The team needs the right tools to build Data products and monitor what's going on in the company. Perhaps some domains lack a deep understanding of technical metrics and their impact on workloads. The platform development team needs resources to identify and address issues such as overburden or inefficiencies.
📖Top Enough Useful Data Visualization Books
Effective Data Storytelling: How to Drive Change - The book was written by American business intelligence consultant Brent Dykes, and is also suitable for readers without a technical background. It's not so much about visualizing data as it is about how to tell a story through data. In the book, Dykes describes his own data storytelling framework - how to use three main elements (the data itself, narrative and visual effects) to isolate patterns, develop concept solutions and justify them to the audience.
Information Dashboard Design is a practical guide that outlines the best practices and most common mistakes in creating dashboards. A separate part of the book is devoted to an introduction to design theory and data visualization.
The Accidental Analyst is an intuitive step-by-step guide for solving complex data visualization problems. The book describes the seven principles of analysis, which determine the procedure for working with data.
Beautiful visualization. Looking at Data Through the Eyes of Experts - this book talks about the process of data visualization using examples of real projects. It features commentary from 24 industry experts—from designers and scientists to artists and statisticians—who talk about their data visualization methods, approaches, and philosophies.
The Big Book of Dashboards - This book is a guide to creating dashboards. In addition, the book has a whole section devoted to psychological factors. For example, how to respond if a customer asks you to improve your dashboard by adding a couple of useless charts.
🤼♂️Hive vs Impala: very worthy opponents
Hive and Impala are technologies that are used to analyze big data. In this post, we will look at the advantages and disadvantages of both technologies and compare them with each other.
Hive is a data analysis tool that is based on the HiveQL query language. Hive allows users to access data in Hadoop Distributed File System (HDFS) using SQL-like queries. However, due to the fact that Hive uses the MapReduce architecture, it may not be as fast as many other data analysis tools.
Impala is an interactive data analysis tool designed for use in a Hadoop environment. Impala works with SQL queries and can process data in real time. This means that users can quickly receive query results without delay.
What are the advantages and disadvantages of Hive and Impala?
Advantages of Hive:
• Hive is quite easily scalable and can handle huge amounts of data;
• Support for custom functions: Hive allows users to create their own functions and aggregates in the Java programming language, allowing the user to extend the functionality of Hive and create their own customized data processing solutions.
Disadvantages of Hive:
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Advantages of Impala:
• Fast query processing: Impala provides high performance query processing due to the fact that it uses the MPP architecture and distributed data in memory. This allows analysts and developers to quickly get query results without delay.
• Restrictions on working with streaming data: Hive is not suitable for working with streaming data because it uses MapReduce, which is a batch data processing framework. Hive processes data only after it has been written to files on HDFS, which limits Hive's ability to work with streaming data.
Disadvantages of Impala:
• Limited scalability: Impala does not handle as large volumes of data as Hive and may experience scalability limitations when dealing with big data. Impala may require more resources to run than Hive.
• High resource requirements: Impala consumes more resources than Hive due to distributed memory usage. This may result in the need for more powerful servers to ensure performance.
The final choice between Hive and Impala depends on the specific situation and user requirements. If you work with large amounts of data and need a simple and accessible SQL-like environment, then Hive might be the best choice. On the other hand, if you need fast data processing and support for complex queries, then Impala may be preferable.
📈📉📊Python libraries for getting data about things you may not have heard of but might be very useful to you
Bokeh is an interactive rendering library for modern web browsers. It provides elegant, concise general-purpose graphics and provides high-performance interactivity when working with large or streaming datasets.
Geoplotlib is a Python language library that allows the user to design maps and plot geographic data. This library is used to draw various types of maps such as heat maps, point density maps, and various cartographic charts.
Folium is a data visualization library in Python that helps the developer visualize geospatial data.
VisPy is a high performance interactive 2D/3D data visualization library. This library uses the processing power of modern graphics processing units (GPUs) through the OpenGL library to display very large datasets.
Pygal is a Python language library that is used for data visualization. This library also develops interactive charts that can be embedded in a web browser.
💥TOP-7 DS-events all over the world in June:
Jun 2-4 - Machine Learning Prague - Prague, Czech - https://mlprague.com/
Jun 7-8 - Data Science Salon NYC | AI and ML in Finance & Technology - New York, NY, USA - https://www.datascience.salon/nyc/
Jun 8 - DATA CENTER 2023 - Любляна, Словения - https://datacenter.palsit.com/en/
Jun 14-15 - AI Summit London 2023 - London, Great Britain - https://london.theaisummit.com/
Jun 18-22 - Machine Learning Week - Las Vegas, USA - https://www.predictiveanalyticsworld.com/machinelearningweek/
Jun 19-22 - The Event for Machine Learning Technologies & Innovations - Munich, Germany - https://mlconference.ai/munich/
Jun 30 - Jul 2 - 4th Int. Conf. on Artificial Intelligence in Education Technology - Berlin, Germany - https://www.aiet.org/index.html
🧐📝🤖Clickhouse data processing: advantages and disadvantages
ClickHouse is an open column database for analytics and processing of large amounts of data. It was developed by Yandex and is designed for processing and analyzing Big Data in real time.
Clickhouse Benefits:
1. High performance: ClickHouse is designed to handle very large amounts of data at high speed. It can efficiently handle queries that require complex aggregations and analytics on billions of rows of data.
2. Scalability: ClickHouse provides horizontal scaling, which means that it can be easily scaled by adding new cluster nodes to increase performance and handle large amounts of data.
3. Low latency: ClickHouse provides low latency query execution due to its columnar architecture that allows you to quickly filter and aggregate the data needed to answer queries.
4. Efficient use of resources: ClickHouse is optimized to work on high-load systems and offers various mechanisms for data compression and memory management, which allows efficient use of server resources.
5. SQL query support: ClickHouse supports the standard SQL query language, which makes it easy to develop and integrate existing tools and applications.
However, despite all the advantages, Clickhouse has a number of disadvantages:
1. Focused on analytics: ClickHouse is the best choice for analytical tasks, but may be less suitable for operational or transactional workloads where frequent data changes or the recording of a large number of small transactions are required.
2. Complexity of configuration and management: Setting up and managing ClickHouse can be a complex process, especially for beginners. Some aspects, such as data distribution, require careful planning and experience to achieve optimal performance.
3. Lack of full support for transactions: ClickHouse does not provide full support for transactions, which may be a disadvantage for some applications that require data consistency and atomic operations.
4. Difficulty in making data schema changes: Making data schema changes in ClickHouse can be complex and require reloading data or rebuilding tables, which can be time and resource consuming.
Thus, ClickHouse is a powerful and efficient system for analytics on large volumes of data, but requires careful planning and configuration for optimal performance.
😱YouTube video has been turned into a data warehouse
The AKA ISG algorithm was created by enthusiasts to turn YouTube videos into free and virtually unlimited data storage. The essence of the algorithm is that it allows you to embed files in videos and upload them to YouTube as part of the video. Each file is made up of bytes, which can be represented as numbers. In turn, each pixel in the video can be interpreted as either white (1) or black (0).
The result is a video, each frame of which contains information.
According to the developers, YouTube has no limit on the number of videos that can be uploaded. This means that, in this way, it is effectively infinite cloud storage.
⚔️🤖Pandas vs Datatable: features of comparison when working with big data
Pandas and Datatable are two popular libraries for working with data in the Python programming language. However, they have some features that are used to choose one or another library for a specific task.
Pandas is one of the most common and popular data manipulation libraries in Python. It provides a wide and flexible toolkit for working with large data types, including tables, time series, multidimensional arrays, and more. Pandas also provides many data manipulation features such as filtering, sorting, grouping, aggregation, and more.
Pandas Benefits:
1. Powerful tools for working with large data types, including tables, time series, multidimensional arrays, and more.
2. Widespread community support that often causes bugs and library updates.
3. A rich set of functions for working with data, such as filtering, sorting, grouping, aggregation and more.
4. Fairly extensive documentation
Disadvantages of pandas:
1. Poor performance when working with large amounts of data.
2. Inconvenience when dealing with high column averages.
Datatable is a library designed to improve the performance and efficiency of working with data in Python. It provides faster data handling than Pandas, especially when working with large amounts of data. Datable provides a syntax very similar to Pandas that makes it easier to switch from one library to another.
Advantages
1. Sufficiently high performance when working with large amounts of data.
2. Syntax very similar to Pandas, which makes it easier to switch from one library to another.
Disadvantages of Datatable:
1. More limited functionality than Pandas
2. Limited cross-platform: some functions in Datatable may work differently on different textures, which can cause problems during development and testing.
3. Small community: Datatable is not as widely used as Pandas, which means that there is relatively little community and users who can help with issues and issues involved with the library.
Thus, the choice between Pandas and Datatable depends on the specific task and performance guarantee. If you need to work with large amounts of data and need maximum performance, then Datatable may be the best choice. If you need to work with data type discovery and operations, then Pandas is the best choice.
🤓🧐Data consistency and its types
The concept of data consistency is complex and ambiguous, and its definition may vary depending on the context. In his article, which was translated by the VK Cloud team, the author discusses the concept of "consistency" in the context of distributed databases and offers his own definition of this term. In this article, the author identifies 3 types of data consistency:
1. Consistency in Brewer's theorem - According to this theorem, in a distributed system it is possible to guarantee only two of the following three properties:
Consistency: the system provides an up-to-date version of the data when it is read
Availability: every request to a node that is functioning properly results in a correct response
Partition Tolerance: The system continues to function even if there are network traffic failures between some nodes
2. Consistency in ACID transactions - In this category, consistency means that a transaction cannot lead to an invalid state, since the following components must be observed:
Atomicity: any operation will be performed completely or will not be performed at all
Consistency: after the completion of the transaction, the database is in a correct state
Isolation: when one transaction is executed, all other parallel transactions should not have any effect on it
Reliability: even in the event of a failure (no matter what), the completed transaction is saved
3. Data Consistency Models - This definition of the term also applies to databases and is related to the concept of consistency models. There are two main elements in the database consistency model:
Linearizability - replication of a single piece of data across multiple nodes affects its processing in the database
Serializability - simultaneous execution of transactions that work with several pieces of data affects their processing in the database
More details can be found in the source: https://habr.com/ru/companies/vk/articles/723734/
😎TOP-10 DS-events all over the world in May:
May 1-5 - ICLR - International Conference on Learning Representations - Kigali, Rwanda - https://iclr.cc/
May 8-9 - Gartner Data & Analytics Summit - Mumbai, India - https://www.gartner.com/en/conferences/apac/data-analytics-india
May 9-11 - Open Data Science Conference EAST - Boston, USA - https://odsc.com/boston/
May 10-11 - Big Data & AI World - Frankfurt, Germany - https://www.bigdataworldfrankfurt.de/
May 11-12 - Data Innovation Summit 2023 - Stockholm, Sweden - https://datainnovationsummit.com/
May 17-19 - World Data Summit - Amsterdam, The Netherlands - https://worlddatasummit.com/
May 19-22 - The 15th International Conference on Digital Image Processing - Nanjing, China - http://www.icdip.org/
May 23-25 - Software Quality Days 2023 - Munich, Germany - https://www.software-quality-days.com/en/
May 25-26 - The Data Science Conference - Chicago, USA - https://www.thedatascienceconference.com//
May 26-29 - 2023 The 6th International Conference on Artificial Intelligence and Big Data - Chengdu, China - http://icaibd.org/index.html
😩Uncertainty in data: common bugs
There is a lot of talk about “data preparation” and “data cleaning” these days, but what separates high quality data from low quality data?
Most machine learning systems today use supervised learning. This means that the training data consists of (input, output) pairs, and we want the system to be able to take the input and match it with the output. For example, the input might be an audio clip and the output might be a transcription of a speech. To create such datasets, it is necessary to label them correctly. If there is uncertainty in the labeling of the data, then more data may be needed to achieve high accuracy of the machine learning model.
Data collection and annotation may not be correct for the following reasons:
1. Simple annotation errors. The simplest type of error is misannotation. An annotator, tired of a lot of markup, accidentally puts sample data in the wrong class. Although this is a simple bug, it is quite common and can have a huge negative impact on the performance of the AI system.
2. Inconsistencies in annotation guidelines. There are often subtleties of various kinds in annotating data items. For example, you might imagine reading social media posts and annotating whether they are product reviews. The task seems simple, but if you start to annotate, you can realize that “product” is a rather vague concept. Should digital media, such as podcasts or movies, be considered products? One specialist may say yes, another no, so the accuracy of the AI system can be greatly reduced.
3. Unbalanced data or missing classes. The way data is collected greatly affects the composition of datasets, which in turn can affect the accuracy of models on specific data classes or subsets. In most real world datasets, the number of examples in each category that we want to classify (class balance) can vary greatly. This can lead to reduced accuracy, as well as exacerbating balance problems and skew. For example, Google's AI facial recognition system was notorious for not being able to recognize faces of people of color, which was largely the result of using a dataset with insufficiently varied examples (among many other problems).
😎🥹Libraries for comfortable data processing
PyGWalker - Simplifies the data analysis and visualization workflow in Jupyter Notebook by turning a pandas dataframe into a Tableau-style user interface for visual exploration.
SciencePlots - A library for creating various matplotlib plots for presentations, research papers, etc.
CleverCSV is a library that fixes various parsing errors when reading CSV files with Pandas
fastparquet - Speeds up pandas I/O by about 5 times. fastparquet is a high performance Python implementation of the Parquet format designed to work seamlessly with Pandas dataframes. It provides fast read and write performance, efficient compression, and support for a wide range of data types.
Feather is a library that is designed to read and write data from devices. This library is great for translating data from one language to another. It is also able to quickly read large amounts of data.
Dask - this library allows you to effectively organize parallel computing. Big data collections are stored here as parallel arrays/lists and allow you to work with them through Numpy/Pandas
Ibis - provides access between the local environment in Python and remote data stores (for example, Hadoop)
📝A selection of sources with medical datasets
The international healthcare system generates a wealth of medical data every day that (at least in theory) can be used for machine learning.
Here are some sources with medical datasets:
1. The Cancer Imaging Archive (TCIA) funded by the US National Cancer Institute (NCI) is a publicly accessible repository of radiological and histopathological images
2. National Covid-19 Chest Imaging Database (NCCID), part of the NHS AI Lab, contains radiographs, MRIs, and Chest CT scans of hospital patients across the UK. It is one of the largest archives of its kind, with 27 hospitals and foundations contributing.
3. Medicare Provider Catalog collects official data from Centers for Medicare and Medicaid Services (CMS). It covers many different topics, from the quality of care in different hospitals, rehabilitation centers, hospices and other healthcare facilities, to the cost of a visit and information about doctors and clinicians. The data can be viewed in a browser, download specific datasets in CSV format, or connect your own applications to the website using the API.
4. Older Adults Health Data Collection on Data.gov consists of 96 datasets managed by the US federal government. Its main purpose is to collect information about the health of people over 60 in the context of the Covid-19 pandemic and beyond. Organizations involved in maintaining the collection include the US Department of Health and Human Services, the Department of Veterans Affairs, the Centers for Disease Control and Prevention (CDC), and others. Datasets can be downloaded in various formats: HTML, CSV, XSL, JSON, XML and RDF.
5.The Cancer Genome Atlas (TCGA) is a major genomics database covering 33 disease types, including 10 rare ones.
6. Surveillance, Epidemiology, and End Results (SEER) is the most reliable source of cancer statistics in the United States, designed to reduce the proportion of cancer in the population. Its database is maintained by the Surveillance Research Program (SRP), which is part of the Division of Cancer Control and Population Sciences (DCCPS) of the National Cancer Institute.
🌎TOP-10 DS-events all over the world in April:
Apr 1 - IT Days - Warsaw, Polland - https://warszawskiedniinformatyki.pl/en/
Apr 3-5 - Data Governance, Quality, and Compliance - Online - https://tdwi.org/events/seminars/april/data-governance-quality-compliance/home.aspx
Apr 4-5 - HEALTHCARE NLP SUMMIT - Online - https://www.nlpsummit.org/
Apr 12-13 - Insurance AI & Innovative Tech USA 2022. Chicago, IL, USA. - Chicago, USA - https://events.reutersevents.com/insurance/insuranceai-usa
Apr 17-18 - ICDSADA 2023: 17. International Conference on Data Science and Data Analytics - Boston, USA - https://waset.org/data-science-and-data-analytics-conference-in-april-2023-in-boston
Apr 25 - Data Science Day 2023 - Vienna, Austria - https://wan-ifra.org/events/data-science-day-2023/
Apr 25-26 - Chief Data & Analytics Officers, Spring. San Francisco, CA, USA. - https://cdao-spring.coriniumintelligence.com/
Apr 25-27 - International Conference on Data Science, E-learning and Information Systems 2023 - Dubai, UAE - https://iares.net/Conference/DATA2022
Apr 26-27 - Computer Vision Summit. San Jose, CA, USA. - San Jose, USA - https://computervisionsummit.com/location/cvsanjose
Apr 26-28 - PYDATA SEATTLE 2023 - Seattle, USA - https://pydata.org/seattle2023/
🤔What is Data Mesh: the essence of the concept
Data Mesh is a decentralized flexible approach to the work of various distributed teams and the dissemination of information. Data Mesh was born as a response to the dominant concepts of working with data in data-driven organizations - Data Warehouse and Data Lake. They are united by the idea of centralization. All data flows into a central repository, from where different teams can take it for their own purposes. However, all this needs to be supported by a team of data engineers with a special set of skills. Also, with the growth in the number of sources and the variety of data, it becomes more and more difficult to ensure their business quality, pipelines for transformation become more and more difficult.
Datamesh proposes to solve these and other problems based on four main principles:
1. Domain-oriented ownership - domain teams own data, not a centralized Data team. A domain is a part of an organization that performs a specific business function, for example, it can be product domains (mammography, fluorography, CT scan of the chest) or a domain for working with scribes.
2. Data as a product - data is perceived not as a static dataset, but as a dynamic product with its users, quality metrics, development backlog, which is monitored by a dedicated product-owner.
3. Self-serve data platform. The main function of the data platform in Data Mesh is to eliminate unnecessary cognitive load. This allows developers in domain teams (data product developers and data product consumers) who are not data scientists to conveniently create Data products, build, deploy, test, update, access and use them for their own purposes.
4. Federated computational governance - instead of centralized data management, a special federated body is created, consisting of representatives of domain teams, data platforms and experts (for example, lawyers and doctors), which sets global policies in the field of working with data and discusses the development of the data platform.
💥YTsaurus: a system for storing and processing Yandex's Big Data has become open-source
YTsaurus is an open source distributed storage and processing platform for big data. This system is based on MapReduce, distributed file system and NoSQL key-value database.
YTsaurus is built on top of Cypress (or Cypress) - a fault-tolerant tree-based storage that provides features such as:
a tree namespace whose nodes are directories, tables (structured or semi-structured data), and files (unstructured data);
support for columnar and row mechanisms for storing tabular data;
expressive data schematization with support for hierarchical types and features of data sorting;
background replication and repair of erasure data that do not require any manual actions;
transactions that can affect many objects and last indefinitely;
In general, YTsaurus is a fairly powerful computing platform that involves running arbitrary user code. Currently, YTsaurus dynamic tables store petabytes of data, and a large number of interactive services are built on top of them.
The Github-repository contains the YTsaurus server code, deployment infrastructure using k8s, as well as the system web interface and client SDK for common programming languages - C ++, Java, Go and Python. All this is under the Apache 2.0 license, which allows everyone to upload it to their servers, as well as modify it to suit their needs.
⛑⛑⛑Medical data: what to observe when working with the health service
The main problem with health data is its vulnerability. They contain confidential information protected by the Health Insurance Portability and Accountability Act (HIPAA) and may not be used without express consent. In the medical field, sensitive details are referred to as protected health information (PHI). Here are a few factors to consider when working with medical datasets:
Protected Health Information (PHI) is contained in various medical documents: emails, clinical notes, test results, or CT scans. While diagnoses or medical prescriptions are not considered sensitive information in and of themselves, they are subject to HIPAA when matched against so-called identifiers: names, dates, contacts, social security or account numbers, photographs of individuals, or other elements that can be used to locate or identify a particular patient, as well as contact him.
Anonymization of medical data and removal of personal information from them. Personal identifiers and even parts of them (such as initials) must be disposed of before medical data can be used for research or business purposes. There are two ways to do this - anonymization and deletion of personal information. Anonymization is the permanent elimination of all sensitive data. Removing personal information (de-identification) only encrypts personal information and hides it in separate datasets. Later, identifiers can be re-associated with health information.
Medical data markup. Any unstructured data (texts, images or audio files) for training machine learning models requires markup or annotation. This is the process of adding descriptive elements (labels or tags) to data blocks so that the machine can understand what is in the image or text. When working with healthcare data, healthcare professionals should perform the markup. The hourly cost of their services is much higher than that of annotators who do not have domain knowledge. This creates another barrier to the generation of high-quality medical datasets.
In summary, preparing medical data for machine learning typically requires more money and time than the average for other industries due to strict regulation and the involvement of highly paid subject matter experts. Consequently, we are seeing a situation where public medical datasets are relatively rare and are attracting serious attention from researchers, data scientists, and companies working on AI solutions in the field of medicine.
📝How to improve medical datasets: 4 small tips
Getting the right data in the right amount - before ordering datasets of medical images, project managers need to coordinate with teams of machine learning, data science and clinical researchers. This will help overcome the difficulty of getting “bad” data or having annotation teams filter through thousands of irrelevant or low quality images and videos when annotating training data, which is costly and time consuming.
Empowering annotator teams with AI-based tools - annotating medical images for machine learning models requires precision, efficiency, high levels of quality, and security. With AI-based image annotation tools, medical annotators and specialists can save hours of work and generate more accurately labeled medical images.
Ensuring ease of data transfer - clinical data should be delivered and communicated in a format that is easy to parse, annotate, port, and after annotation, quickly and efficiently transfer to an ML model.
Overcome the complexities of storage and transmission - medical image data often consists of hundreds or thousands of terabytes that cannot simply be mailed. Project managers need to ensure the end-to-end security and efficiency of purchasing or retrieving, cleaning, storing and transferring medical data
💥Top sources of various datasets for data visualization
FiveThirtyEight is a journalism site that makes its datasets from its stories available to the public. These provide researched data suitable for visualization and include sets such as airline safety, election predictions, and U.S. weather history. The sets are easily searchable, and the site continually updates.
Earth Data offers science-related datasets for researchers in open access formats. Information comes from NASA data repositories, and users can explore everything from climate data to specific regions like oceans, to environmental challenges like wildfires. The site also includes tutorials and webinars, as well as articles. The rich data offers environmental visualizations and contains data from scientific partners as well.
The GDELT Project collects events at a global scale. It offers one of the biggest data repositories for human civilization. Researchers can explore people, locations, themes, organizations, and other types of subjects. Data is free, and users can also download RAW data sets for unique use cases. The site also offers a variety of tools as well for users with less experience doing their own visualizations.
Singapore Public Data - another civic source of data, the Singapore government makes these datasets available for research and exploration. Users can search by subject through the navigation bar or enter search terms themselves. Datasets cover subjects like the environment, education, infrastructure, and transport.