Data Standardization Using Hidden Markov Model

Download Project Document/Synopsis

Record linkage refers to the process of joining records that relate to the same entity or event in one or more data collections. In the absence of a shared, unique key, record linkage involves the comparison of ensembles of partially-identifying, non-unique data items between pairs of records. Data items with variable formats, such as names and addresses, need to be transformed and normalized in order to validly carry out these comparisons. Traditionally, deterministic rule-based data processing systems have been used to carry out this pre-processing, which is commonly referred to as “standardization”. This project describes an alternative approach to standardization, using a combination of lexicon-based tokenization and probabilistic Hidden Markov Models (HMMs). The project is developed using Visual Studio with C# .Net as programming language. There is only one entity who will have the access to the system which is admin. Admin first need to login using its login credentials and then only he/she can access the system. After successful login, admin can now add the training data by filling up all the registration fields. While analyzing the data, admin will be asked to fill data in random manner into information fields. The input data in the information tabs is a collection of unstructured data which needs to be structured in a proper manner. After filling the random data in random field, admin can now analyse the inputted data, and based on input data the algorithm will process and provide the data in a structured format and the data will be displayed into analysed data in their respective fields.



Advantages
  • Strong statistical foundation efficient learning algorithms-learning can take place directly from raw sequence data.
  • All consistent treatment of insertion and deletion penalties in the form of locally learnable Can handle inputs of variable length-they are the most flexible generalization of sequence profiles.
  • Wide variety of applications including multiple alignment, data mining and classification, structural analysis, and pattern discovery.
Disadvantages
  • HMMs often have a large number of unstructured parameters.
  • First order HMMs are limited by their first-order markov property.
  • They cannot express dependencies between hidden states.
  • Only a small fraction of distributions over the space of possible sequences can be represented by a reasonably constrained HMM.
  • It may provide inaccurate results if data not entered correctly.

Leave a Comment

Your email address will not be published.